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REMARKS 

Reconsideration of this application is respectfully requested. 

Claims 35, 37, 39, 41 , 43, and 45 were rejected under 35 U.S.C. § 102(e) as 
allegedly being anticipated by Chang et al. (U.S. Patent No. 6,001,977) and claims 35- 
46 were rejected under 35 U.S.C. § 103(a) as allegedly being unpatentable over Chang 
et al. in view of White et al. (U.S. Patent No. 4,677,054). Applicants' claims 35-46 recite 
methods and kits using probes comprising HIV-1 ORF-1, ORF-4, and ORF-R 
sequences. The proteins encoded by ORF-1 , ORF-4, and ORF-R are now known as 
the Vpr, Vpu, and Nef proteins of HIV-1 . The Examiner alleges that Chang discloses 
the claimed nucleic acids. The basis for the Examiner's rejection is that the difference 
between Chang's sequence and applicants' claimed sequence is within a range that can 
be attributed to sequencing errors, and that the sequences are identical since "the art 
recognizes that sequencing errors occur in a range between 0.3% and 2.5%, as 
evidenced by Richterich (Genome Research (1998) 8:251-259)." (Paper No. 32 at 12.) 
Applicants traverse the rejection. 

Richterich (Exhibit 1) does not support the Examiner's position. Richterich does 
not suggest that the difference between applicants' claimed sequence and Chang's 
sequence is due to sequencing errors. Rather, Richterich estimates sequencing errors 
in "raw" DNA sequence data. (Richterich at 251, Title.) As Richterich explains, "raw" 
DNA sequence data refers to DNA sequences from "large-scale DNA sequencing" 
projects, (Id. at 251 , col. 1, 1 .) These are projects where, due to mass-production, 
many different DNA sequences are generated, but are not subject to any "polishing" by 
further sequencing efforts. (See id. at 251 , Abstract.) In the context of these large- 
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scale DNA sequencing projects, any "resequencing" or "assessments" is considered 
inefficient and a time delay. (Id.) Thus, the error rates of Richterich are based on these 
mass-production sequencing efforts, not on efforts to sequence a specific DNA clone 
where polishing, resequencing, and assessments are integral and essential parts of the 
sequencing effort. 

Nonetheless, even in these large-scale DNA sequencing projects, the 1 .44 
million sequence bases with high quality scores (/.e, good quality sequence) contained 
only 237 errors, which is a very low error rate. (See id at 252, col. 1 , If 1 .) The 
Examiner has offered no reasons why error rates of "between 0.3% and 2.5%," as 
opposed to this substantially lower error rate (approximately 0.017%), should be 
applicable to Chang's sequences. Applicants note that there is no evidence of record 
that Chang's sequences are not good quality sequences. 

Moreover, neither applicants' sequencing nor Chang's sequencing were the type 
of large-scale DNA sequencing projects referred to by Richterich, and cannot be 
considered "raw" DNA sequences. Rather, applicants' and Chang's sequences are 
"polished" sequences, since "polishing" is an integral and essential part of any 
sequencing effort. (See, e.g., Current Protocols in Molecular Biology at 7.1.1. (Exhibit 
2) and Sambrook et al. at 13.20 (Exhibit 3).) 

Chang's sequences would be expected have much lower error rates than the 

large-scale DNA sequencing projects referred to by Richterich. For example, Chang 

describes the sequencing of HIV-1 DNA as follows: 

Genetic engineering methods are used to determine the nucleotide 
sequence of HTLV-III DNA. One technique that can be used to 
determine the sequence is a shotgun/random sequencing method. 
HTLV-III DNA is sheared randomly into fragments of about 300-500 
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bp in size. The fragments are cloned, for example, using ml3, and 
the colonies screened to identify those having an HTLV-III DNA 
fragment insert. The nucleotide sequence is then generated, with 
multiple analysis producing overlaps in the sequence. Both strands 
of the HTLV-III DNA are sequenced to determine orientation. 
Restriction mapping is used to check the sequencing data 
generated. 

('977 patent at 8, lines 28-39.) To assure a high quality of sequence, Chang indicates 
that the sequence is "polished" by having multiple analyses producing overlaps and 
sequencing both strands. Chang's HIV-1 sequences do not contain intact nef or vpr 
orfs. ('977 patent at Fig.3.) 

Similarly applicants 7 sequences would be expected have much lower error rates 
than the large-scale DNA sequencing projects referred to by Richterich. Sequencing of 
applicants' HIV-1 clone is fully detailed in Wain-Hobson et al. (1995)(Exhibit 4). Wain- 
Hobson et al. states: "Each nucleotide was sequenced on average 5.3 times: 85% of 
the sequence was determined on both strands and the remainder was sequenced at 
least twice from independent clones." (Wain-Hobson et al. at 12, legend to Fig. 1 .) 
Thus, applicants' sequence is a "highly polished" sequence. Applicants' HIV-1 
sequence contains intact nef and vpr orfs. (Specification at 13 and Figs. 9, 1 1 , and 12 
and Wain-Hobson et al. at Fig. 1.) 

Sequencing the same region multiple times leads to higher accuracy. (Current 
Protocols in Molecular Biology at 7.1 .1 .) Also, sequencing both strands leads to higher 
accuracy. (Sambrook et al. at 13.20.) This additional resequencing was not done in 
Richterich, but was considered a time delay. Thus, Richterich's error rates of "between 
0.3% and 2.5%," are not applicable to a comparison of Chang's and applicants' 
sequences, which are not "raw" DNA sequences. 
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Instead, one skilled in the art would have expected that both applicants' and 
Chang's "polished" sequences would have very low error rates. As Sambrook et al. 
explains: "When DNA sequencing is carried out carefully, the error rate is less than 
0.1%." (Sambrook et al. at 13.20.) There is no reason to believe that applicants 1 and 
Chang's sequencing were not performed carefully. Thus, the skilled artisan would have 
expected error rates of less than 0.1% for applicants' and Chang's sequences. With 
error rates of less than 0.1%, sequencing errors cannot explain the differences between 
applicants' and Chang's sequences. 

Furthermore, applicants submit herewith Ratner et al. (Exhibit 5) as objective 
evidence that the differences between applicants' and Chang's sequences are real. 
Ratner et al. resequenced Chang's clone BH10 (the sequence of which is shown in 
Figure 3 of the 6,001 ,977 patent) from both strands. (Ratner et al. at 59, U 4.) Ratner's 
DNA sequence of BH10 is shown in Figure 1. (Id. at 60-61.) Similar to the sequence of 
BH10 in the '977 patent, Ratner's sequence of BH10 contains a stop codon at position 
124 in the 206 codon 3' orf gene (i.e., nef). (Id. at 61.) Similarly, Ratner's sequence of 
BH10 contains a frameshift in the vpr orf. (Id.) Consequently, Chang's BH10 clone 
does not encode applicants' Nef or Vpr proteins. Ratner et al. indicates that few if any 
of the sequence differences shown in Figure 1 are likely to represent cloning artifacts or 
sequencing errors. (Id. at 59, jf 4.) Consequently, Ratner et al. provides objective 
evidence that contradicts the Examiner's allegation that the differences between 
applicants' and Chang's sequences are due to sequencing errors. 

Applicants were the first to identify the ORF-R (Nef) and ORF-1 (Vpr) reading 
frames. Chang was unable to identify these reading frames since Chang's sequences 
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did not contain the complete neforf because it contained a stop codon and did not 
contain the complete vpr orf because it contained a frameshift. Consequently, 
applicants' claimed nucleic acids cannot be anticipated by Chang. Accordingly, 
applicants respectfully request withdrawal of the rejection. 

Applicants respectfully submit that this application is now in condition for 
allowance. In the event that the Examiner disagrees, he is invited to call the 
undersigned to discuss any outstanding issues remaining in this application in order to 
expedite prosecution. 

Please grant any extensions of time required to enter this response and charge 
any additional required fees to our deposit account 06-0916. 
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Estimation of Errors in "Raw" DNA 
Sequences: A Validation Study 

Peter Richterich 1 

Genome Therapeutics Corp., Waltham, Massachusetts 02154 USA 

As DNA sequencing is performed more and more in a mass-production-like manner, efficient quality control 
measures become increasingly important for process control, but so also does the ability to compare different 
methods and projects. One of the fundamental quality measures in sequencing projects is the position-specific 
error probability at all bases in each individual sequence. Accurate prediction of base-specific error rates from 
"raw" sequence data would allow immediate quality control as well as benchmarking different methods and 
projects while avoiding the inefficiencies and time delays associated with resequencing and assessments after 
"finishing" a sequence. The program PHRED provides base-specific quality scores that are logarythmically 
related to error probabilities. This study assessed the accuracy of PHRED's error-rate prediction by analyzing 
sequencing projects from six different large-scale sequencing laboratories. All projects used four-color 
fluorescent sequencing, but the sequencing methods used varied widely between the different projects. The 
results indicate that the error-rate predictions such as those given by PHRED can be highly accurate for a large 
variety of different sequencing methods as well as over a wide range of sequence quality. 



In DNA sequencing, knowledge about the accuracy 
of sequences can be very valuable. For example, dif- 
ferent large-scale sequencing projects may produce 
sequences at similar rates and costs but with signifi- 
cantly different error rates in the final sequence. 
One major determinant in the final error rate is the 
accuracy of the "raw" sequence. Knowledge about 
the frequency and location of errors in the raw se- 
quence data can help to direct "polishing" efforts to 
the places where additional effort is needed; it also 
enables the comparison between different sequenc- 
ing projects without requiring that the same region 
be sequenced in each project. 

Another area where estimates about sequence 
error rates would be beneficial is technology devel- 
opment. Accurate error estimates at each base would 
enable "quality benchmarking" between different 
methods, thus enabling researchers to choose the 
method that fills their needs for accuracy and 
throughput best. 

Several groups have developed mathematical 
models to predict the error probability at any given 
position in raw sequences. Lawrence and Solovyev 
used linear discriminant analysis to calculate sepa- 
rate probability estimates for insertions, deletions, 
and mismatches (Lawrence and Solovyev 1994). Ew- 
ing and Green (1998) developed the program 
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PHRED, which calculates a quality score at each 
base. This quality score q is logarithmically linked to 
the error probability p: </ = —10 x log l0 (p) (for a 
discussion of how quality scores are calculated and 
what the limitations are, see Ewing et al. (1998). 
When used in combination with sequence assembly 
and finishing programs that utilize these error esti- 
mates, reliable error probabilities promise to in- 
crease the accuracy of consensus sequences and to 
reduce the efforts required in the finishing phase of 
sequencing projects (Churchill and Waterman 
1992; Bonfield and Staden 1995). 

To examine the accuracy of probability esti- 
mates made by the program PHRED, we compared 
the actual and predicted error rates for six different 
cosmid- or BAC-sized projects that were produced 
by six different large-scale sequencing centers in the 
United States. All of these six projects used four- 
color fluorescent sequencing machines; however, 
the DNA preparation methods, sequencing en- 
zymes, fluorescent dyes and chemistries, and gel 
lengths varied significantly between the six groups. 
Table 1 gives an overview of the sequencing projects 
analyzed. Table 2 lists the different methods used. 

RESULTS 

Error Rate Prediction Accuracy for Six Projects 

A comparison of actual and predicted error rates for 
the six projects in this study is shown in Table 3, 
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Table 1. 


Summary of Data Sets 




Project 


Reads 


Aligned 
bases 


Average 
aligned 
read 
length 


A 


455 


416,214 


915 


B 


1277 


871,230 


682 


C 


1065 


603,655 


567 


D 


834 


414,595 


497 


E 


1638 


1,149,209 


702 


F 


1885 


907,796 


482 


Total 


7154 


4,362,699 


610 



The results indicate that PHRED is very successful in 
identifying bases with low error probabilities. For 
example, the 1.28 million bases with quality scores 
of 4-12 (corresponding to error probabilities be- 
tween 39.8% and 6.3%) contain a total of 187,926 
errors. In contrast, the 1.44 million bases with qual- 
ity scores between 33 and 42 (corresponding to error 
probabilities between 0.05% and 0.006%) contain 
only 237 errors, which translates into a 790-fold 
lower error rate. The trend toward lower error rates 
can also be observed for each individual project. In 
most cases, the actual number of errors is close to 
the predicted error rate. It is also apparent that the 
actual error rate is typically lower than the predicted 
error rate. 

Both the high overall accuracy and the ten- 
dency to slightly overpredict errors are confirmed 
by statistical analysis, as shown in Table 4. The cor- 
relation between predicted and actual error frequen- 
cies is excellent for all projects (Spearman correla- 
tion coefficient >0.89, P < 0.0001). Averaged over all 
projects, the actual error rate is 84.5% of the pre- 
dicted error rate; the slope of the relation between 
predicted and actual error rates differs slightly be- 
tween projects and ranges from 76.6% to 88.4%. To 
put these differences between projects in relation, it 
is worthwhile remembering that PHRED quality 
scores cover a wide dynamic range: The maximum 
quality score of 51 corresponds to a 50,000-fold 
lower predicted error rate than the minimum qual- 
ity score of 4. Even the relative difference between 
successive quality is larger than the relative differ- 
ence in the slopes; for example, a quality score of 10 
corresponds to an error probability of 10%, whereas 
a score of 9 corresponds to an error probability of 
12.6%. 

A different way of looking at the relation be- 
tween the actual and predicted error rates is shown 



in Figure 1 . Here, the error rates as a function of the 
position within all reads in each of the projects, av- 
eraged over 50-base windows, is depicted. For all six 
projects, the predicted error rates are very close to 
the actual error rates over the entire length of the 
sequences. Each project has a characteristic distribu- 
tion of error rates, which differs from each of the 
other projects. The minimum error rate differs dra- 
matically between projects. The best projects 
achieve raw error rates of 0.23%-0.36% in the best 
region of the sequence read, typically from base 150 
to 200. The worst project in the data set had an 
~10-fold higher error rate of 2.58%. 

Toward the end of sequence reads, the error 
rates increase and start to exceed 10% between bases 
300 and 700. In projects that used mainly short gels 
(e.g., projects D and F), this increase begins sooner, 
whereas projects that use longer gels show a mark- 
edly longer stretch of low error rates (e.g., projects A 
and B). 

Table 5 summarizes key results for the six 
projects. The first four projects have similar mini- 
mum and average error rates. However, the length 
of the region where the error rate is below 5% differs 
significantly, from 403 to 682 bases. The project 
with the shorter low error rate regions contained 
larger portions of reads generated on short gels, 
whereas projects A and B were run exclusively on 
long gels (ABI373 stretch or ABI377 sequencers). 
Other factors contributing to differences between 
the first four projects were differences in sequencing 
chemistries, production scale, and electrophoresis 
conditions and machines. 

Project E and, in particular, project F, had sig- 
nificantly higher error rates than the first four 
projects. In projects E and F, every sequence gener- 
ated for the project had been included in the data 
set, whereas the other four projects had eliminated 
some "bad" sequences through manual or auto- 



Table 2. Overview of Sequencing Meth ds 
Used in the Different Projects 

Template DNA single-stranded M13, 

double-stranded plasmids 
Sequencing Sequenase, Taq, KlenTaqTR, 

enzymes AmpliTaq FS 

Sequencing Dyes primer (two different dyes 

chemistries chemistries), dye terminator 

Sequencing ABI 373, ABI 373 stretch, 

machines ABI 377 

Gel length Only short gels, only long gels, 

mixes of short and long gels 
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Table 3. Comparison of Predicted and Actual Error Rates for Six Different 
Sequencing Projects 



Project 


Quality score 


4-12 


13-22 


23-32 


33-42 


43-51 


A 

A 


anyneo uases 
expected errors 
actual errors 


1 1Q O/I C 

20,256 
16,784 


2,064 
1,758 


/u,oy i 

172 

127 


AAA Q7fi 

37 
17 


1 
1 


D 
D 


o 1 i r~\ v\ ar\ hacoc 
aliyilcU Udbco 

expected errors 
actual errors 


29,953 
26,038 


3,704 
2,536 


101 QQQ 

410 
287 


*iqq fiqn 

102 

35 


3 
0 




aligned oases 
expected errors 
actual errors 


1 1Q 1AK 

22,277 
16,670 


I'll 41 Q 

3,411 
1,513 


1^1 1 Q7 

357 
194 


74 
26 


2 
3 


D 


aligned bases 
expected errors 
actual errors 


103,898 

16,880 

14,495 


68,995 

1,919 

1,924 


68,613 

168 

146 


153,730 

38 

59 


1 1 1,752 

3 

2 


E 


aligned bases 
expected errors 
actual errors 


378,755 

63,947 

55,968 


217,438 

6,336 

6,516 


167,968 

418 

355 


392,717 

95 

67 


144,313 

4 

5 


F 


aligned bases 
expected errors 
actual errors 


359,809 

66,938 

57,971 


136,688 

4,079 

3,856 


98,840 

256 

332 


64,035 

23 

33 


5,130 

0 

1 


All 


aligned bases 
expected errors 
actual errors 


1,283,087 

220,252 

187,926 


767,773 

21,513 

18,103 


739,007 

1,781 

1,441 


1,447,118 

370 

237 


543,134 

13 

12 



ma tic inspection. After eliminating <10% of the 
worst sequences in project E, the error rates for the 
remaining sequences were comparable to those of 
the first four projects. In contrast, project F showed 
a much more uniform distribution of sequence 
quality. 



The last column in Table 5 shows the average 
number of bases with an estimated error probability 
of at most 0.1%, which is equivalent to a quality 
score of at least 30. The count of such "very high- 
quality" bases is a good indicator of sequence qual- 
ity, both for individual sequences and, when aver- 



table 4. Summary of Statistical Analysis Results 



Project 


Spearman 
P 


P>|p| 


Slope 


r ratio 


p>n 


A 


0.9646 


<0.0001 


0.818 


75.1 


<0.0001 


B 


0.9890 


<0.0001 


0.874 


98.2 


<0.0001 


C 


0.9846 


<0.0001 


0.766 


71.6 


<0.0001 


D a 


0.8692 


<0.0001 


0.855 


68.3 


<0.0001 


E 


0.9956 


<0.0001 


0.884 


144.3 


<0.0001 


F 


0.9968 


<0.0001 


0.865 


151.6 


<0,0001 


All 


0.9964 


<0.0001 


0.845 


174,5 


<0.0001 



a ln project D, the Spearman correlation coefficient p was artificially low as only very few bases (10) bases had 
a quality score of 5, and none of these bases contained an actual error (expected: 3.16 errors). Exclusion of 
this quality score gave a Spearman correlation coefficient of 0.9786 (P< 0.0001). The frequencies in the slope 
calculations were weighed by the number of bases at any given quality score and, thus, were not sensitive to 
such small sample distortions (see Methods). 
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Figure 1 Actual and predicted error rates in six different sequencing projects. Actual error rates and predicted 
error rates in 50-base windows over the length of the sequence reads, averaged over all reads that could be aligned 

to the consensus sequence by CROSS MATCH, are shown. The numbers on the x-axis show the first base in a given 

50-base window. 



aged over all sequences in a project, as an indicator 
for the entire project. Compared to the estimated 
error rates, the count of very high-quality bases is 
less prone to distortions from a small number of 
low-quality reads, as the data for project E demon- 
strate. 



Prediction Accuracy for Data Subsets of Different 
Quality 

The quality of sequences within any given project 
can vary substantially, and the use of predicted error 
rates has the potential to be a powerful tool for qual- 



Table 5. 


Comparison of Key Results for Six Different Sequencing Projects 






Actual minimum 


Actual average 


Length of 


Length of 


Average bases with 


Project 


error rate (%) 


error rate (%) 


<1% error region 


<5% error region 


P(error) <0.1% 


A 


0.36 


3.6 


422 


682 


468 


B 


0.34 


2.8 


274 


567 


395 


C 


0.23 


2.4 


291 


479 


348 


D 


0.39 


3.1 


300 


403 


294 


E 


0.71 


4.7 


129 


464 


317 


F 


2.58 


9.2 


0 


162 


79 
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ity analysis and control in large-scale DNA sequenc- 
ing projects. To analyze how accurate PHRED error 
estimates are for different quality sequences within 
the same sequencing project, we subdivided a data 
set into four quartiles, based on the number of very 
high-quality bases in each sequence (see Methods) . 
The comparison of actual and predicted error rates is 
shown in Figure 2. 

When measured by the error rate in the best 
region of a sequence, the data quality in the differ- 
ent quartiles varies > 100-fold between the best and 
the worst 25% of the sequences. The best quartile 
showed -0.03% error for >100 bases, whereas the 
error rate in the worst quartile always exceeded 5%. 
In quartiles 2 and 3, the predicted error rates match 
the actual error rates very closely. In the best and 
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Figure 2 Actual and predicted error rates in different quality subsets of project 
B. Sequence reads were sorted by the number of bases with a predicted error rate 
of at most 0. 1 % (very high-quality bases), and assigned to quartiles, with quartile 
1 corresponding to the highest numbers. Actual and predicted error rates for all 
sequences in each subset were calculated as in Fig. 1 . Note that a number of 
sequence reads that had been rejected because of too low quality were added 
back to the data set for illustrative purposes, all of which are in quartile 4. These 
sequences were not included in the data sets used to generate Figs. 1 and 3 and 
Tables 1 and 3. 



worst quartiles, PHRED 's accuracy was somewhat 
lower from base 100 to 500. In the best sequences, 
PHRED's error estimates were about twofold too 
high; in the worst sequences, the error estimates 
were too low, again by a factor of 2. This underpre- 
diction of errors can be partially explained by the 
fact that PHRED gives ambiguous base calls (ATs) a 
quality score of 4, corresponding to an error prob- 
ability of 39.8%; however, ATs will always show up 
as an actual error. Even in the worst and best quar- 
tiles, however, the predicted error rate curves are 
very similar to the actual error rate curves. 

The results shown in Figure 2 also demonstrate 
that the count of very high-quality bases, or bases 
with an estimated error probability of at most 0.1%, 
can be used effectively to characterize the overall 
quality of a sequence read. 
Sorting the sequence reads 
into quartiles based on the 
number of very high-quality 
bases worked well, as shown 
by the > 100-fold difference in 
the minimum error rate be- 
tween the first and the fourth 
quartile. 

Other methods to charac- 
terize the overall quality of in- 
dividual reads based on 
PHRED quality scores can give 
similar results. For example, 
counting bases above a mini- 
mum quality threshold any- 
where in the range of 20-40 
gave similar results for most 
data sets (not shown), and 
such counts are used by a 
number of different laborato- 
ries as quality measures. Alter- 
natively, the quality values 
can be converted to error 
probabilities and averaged to 
give the predicted error rate 
for the trace, or summed to 
give the total predicted num- 
ber of errors in a trace. How- 
ever, such averages and totals 
can sometimes give a mislead- 
ing picture, as the following 
example illustrates. Assume 
that two sequence reads have 
very similar quality in the 
alignable part of the read but 
that one of the two sequences 
was run much longer and 
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Figure 3 Actual frameshift and total error rates for projects A and B. To calcu- 
late frameshift error rates, only insertions and deletions were counted. Mismatch 
errors, which account for the vast majority of errors after base 1 50, were included 
only in the total error count Note that project B (A,A) has a slightly similar or 
slightly higher total error rate compared to project A (#,0) but only about 
one-third as many insertions and deletions up to base 500. For both projects, the 
frameshift error rate in the raw data is <1 in 1000 for >300 bases, and ^1 in 
10,000 for >100 bases in project B. 



therefore contains a longer unalignable "tail" of 
very low-quality bases. When calculating the aver- 
age error rate for these two sequences, the second 
sequence will have a much higher average error and, 
therefore, appear to be of lower quality. In contrast, 
the counts of very high-quality bases for both se- 
quences will be very similar, as the unalignable tails 
contain few, if any, high-quality bases. Therefore, 
counts of bases above a high enough quality thresh- 
old will give a more robust and clearer picture of 
trace quality. 

Frameshift Error Rates for Different Sequencing 
Chemistries 

Depending on how biologists use DNA sequences, 
knowledge about total error rates in raw sequences 
may or may not be sufficient. For example, frame- 
shift errors in coding sequences will generally lead 
to incorrectly predicted open reading frame, 
whereas mismatch errors will do so only if the mis- 
match introduces a stop codon or a new splice site. 
At the time of this writing, PHRED did not differen- 
tiate between mismatch and frameshift errors, but 
only estimated total error rates. This might occa- 



sionally lead to questionable 
conclusions, as the results 
shown in Figure 3illustrate. 

Figure 3 shows the total 
actual error rates and the 
frameshift error rates for two 
projects, A and B. The total er- 
ror rates for both projects are 
similar for up to 350 bases; af- 
ter 350 bases, project B has a 
somewhat higher total error 
rate. However, examining the 
frameshift error rate gives rise 
to a different picture: from 
base 1 to 500, project A has 
approximately four times as 
many insertions and dele- 
tions as project B. This differ- 
ence in frameshift error rates 
can be explained by the se- 
quencing chemistries that 
were used in the two projects. 
Project B, with the lower 
frameshift error rate, used 
only dye terminator chemis- 
try, which is known to elimi- 
nate band spacing artifacts 
from hairpin structures ("com- 
pressions"). Project A, on the 
other hand, used dye primer chemistry, which is 
more prone to insertion and deletion errors from 
mobility artifacts, for most sequencing reactions. 

DISCUSSION 

As large-scale DNA sequencing has become a more 
routine and common process, the traditional meth- 
ods for assessing sequence quality have become un- 
satisfactory. In projects like single-pass cDNA se- 
quencing, it is not possible to calculate and compare 
error rates after finishing a sequence, as finishing 
never takes place. Even when a comparison between 
raw and finished sequence can be done, the time 
delay between raw data generation and quality as- 
sessment is often large. This delay makes it difficult 
to improve ongoing projects, and it sometimes 
makes it impossible to capture problems early on. 
Some immediate quality feedback can be reached by 
including known standard sequences for quality 
control. However, this approach can be costly, and 
it fails when error profiles differ between standard 
and unknown sequences. 

In contrast to these traditional methods to as- 
sess sequence accuracy, direct estimation of error 
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rates in raw sequence data would enable immediate 
quality control and feedback. Accurate, base-by- 
base estimates of error probabilities could also in- 
crease the utility of single-pass sequences signifi- 
cantly, allow efficient comparison and optimization 
of different sequence chemistries, and enable the 
development of better software tools for sequence 
assembly and analysis. 

The critical question for any error rate predic- 
tion tool is how accurate are the error rate estimates, 
in particular if different sequencing methods and 
chemistries are used? The results presented herein 
provide an answer to this question for the program 
PHRED, as well as clues where further development 
would be useful. As shown in Tables 3 and 4 and in 
Figure 1 , the agreement between predicted and ac- 
tual error rates was very good in each of the six 
different projects analyzed. The observed high level 
of prediction accuracy in all of these projects is al- 
most astonishing if one takes into account that ac- 
tual errors are binary (a base is either correct or 
wrong) , whereas predicted error rates are probabili- 
ties on a scale from 0.0 to 1.0. The observed ten- 
dency to overpredict error rates can be at least par- 
tially explained by the "small sample correction" 
that was used in the derivation of threshold param- 
eters for quality scores (Ewing and Green 1998), For 
most practical applications, such a somewhat con- 
servative estimation of quality scores is tolerable or 
even desirable. Overall, the results clearly show that 
error probabilities given by PHRED accurately de- 
scribe raw sequence data quality. 

In judging the usefulness of predicted error 
probabilities, it is important to know how differ- 
ences in sequencing methods will influence the pre- 
diction accuracy. For example, the larger variation 
in peak heights tends to be larger in dye terminator 
sequencing than in dye primer sequencing, and dif- 
ferent sequencing enzymes are known to produce 
different specific height variation patterns. Any es- 
timation of error probabilities that takes the pecu- 
liarities of a specific sequencing chemistry into ac- 
count would therefore be expected to be less accu- 
rate for different chemistries. 

The projects included in this study were specifi- 
cally chosen to provide an initial answer to the 
question of how generally useful PHRED quality 
scores are. These projects represent the vast majority 
of different multicolor fluorescent sequencing 
methods used in the last 3 years: different template 
DNAs and DNA preparation methods, different en- 
zymes, gel lengths, run conditions, and different 
fluorescent dyes. The data also include a consider- 
able spread in data quality, both between projects 



and within individual projects. None of the projects 
analyzed here were included in PHRED's training 
set, and just one of the six laboratories that contrib- 
uted data to this study also contributed data to the 
training data sets. One of the projects in this study 
consisted entirely of dye terminator sequences, 
which presented only a small fraction of the se- 
quences in the test data set. Another project exclu- 
sively used a set of fluorescent dyes different from 
those used in the training sets. Each project differed 
from the other projects in this study in at least one, 
and typically many, experimental aspects like tem- 
plate preparation, sequencing enzymes, gel run con- 
ditions, and so forth. Despite these differences, the 
accuracy of error rate predictions was very similar 
for all projects. 

Our results justify some optimism about the ac- 
curacy of PHRED quality scores for minor changes 
in sequencing technology, for example, sequences 
generated by new enzymes and fluorescent dyes. 
Initial studies showed that PHRED quality scores 
were also accurate for sequences produced by mul- 
tiplex sequencing with radioactive detection (P. 
Richterich, unpubl.). However, we also observed 
two effects that can invalidate PHRED quality scores 
during these studies. First, sequences generated by 
chemical sequencing gave too low quality scores at 
mixed (A + G) reactions. Because secondary peak 
height is one of the parameters used in the error rate 
predictions, this is not surprising. Another potential 
source of error is high-frequency noise in the trace 
data. With such data, PHRED occasionally underes- 
timated the band spacing by a factor of 2 or more, 
which resulted in incorrect base calls and quality 
scores. By applying simple smoothing algorithms to 
data with high-frequency noise, these problems 
could typically be resolved. Similar steps may be 
necessary to obtain accurate PHRED quality scores 
on data that have been generated by different se- 
quencing instruments or preprocessed by different 
software. 

Accurate quality scores can have a major impact 
on how sequences are used downstream from the 
sequence production process. In traditional se- 
quencing projects where the goal is complete cov- 
erage at a final error rate below (e.g.) 1 in 10,000, the 
accuracy goals can be reached with single sequence 
reads as long as the quality scores are at least 40 
(however, other potential problems like clone insta- 
bility may make higher coverage advisable). Inter- 
esting questions arise as to how individual read 
quality contributes to project quality, or the error 
rate of the "final" sequence. Under the assumption 
that errors between different sequence reads are 
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completely independent, one could argue that two 
reads with a quality score of 20 (error probability of 
1 in 100) are just as valuable as one sequence with a 
quality score of 40 (error probability of 1 in 10,000). 
However, although a single sequence stretch with 
quality levels above 40 would give a final sequence 
with an error rate of <1 in 10,000, assembling a con- 
sensus from two sequences with quality scores of 20 
(1% error rate) could lead to one of two results: If 
the errors were completely random, the consensus 
sequence would be ambiguous at 2% of all loca- 
tions; if the errors were completely localized, for ex- 
ample, because of reproducible compressions, the 
consensus sequence would have one "hidden" error 
every 1 00 bases. Typically, consensus sequences de- 
rived from low-quality sequences will have both 
kinds of problematic regions. Increased coverage 
can rapidly eliminate the random errors; however, 
increased coverage does not resolve errors from sys- 
tematic sources. Manual examination of such prob- 
lem areas is generally required; such "contig edit- 
ing," however, tends to be time consuming, re- 
quires highly trained personnel, is an obstacle 
toward complete automation of DNA sequencing, 
and sometimes fails to eliminate all errors. This 
leads to the somewhat counterintuitive conclusion 
that the practical value of increasing sequence qual- 
ity can be even higher than indicated by the quality 
scores: One sequence of average quality above 40 
can be "worth" more than two sequences of average 
quality 20. 

Another application of DNA sequencing where 
high quality can be of disproportionately high value 
is the search for mutations in genomic DNA. In low 
quality sequences, secondary peaks and low resolu- 
tion often complicate the identification of hetero- 
zygous mutations. In regions of higher sequence 
quality, such secondary peaks are smaller or absent 
and peaks are better resolved. Therefore, both false- 
positive and false-negative errors can be signifi- 
cantly reduced in high-quality regions. Tools like 
PHRED, which can accurately measure sequence 
quality from trace data, can be of twofold value for 
mutation detection. First, base-specific quality 
scores can allow optimization of sequencing meth- 
ods and strategies for mutation detection. Second, 
the quality scores can be used to evaluate the use- 
fulness of individual sequence reads for mutation 
detection (e.g., by discarding reads below minimum 
thresholds) , and they can guide software that auto- 
matically detects mutations. 

The ability to predict error rates in a highly ac- 
curate fashion is likely to have a major impact in 
applications like those described above. PHRED is 



the first widely used program that accurately pre- 
dicts base-specific error probabilities. However, the 
algorithm for determining quality values has been 
described (Ewing and Green 1998), and it should be 
straightforward to implement similar quality values 
in other base-calling programs. Furthermore, an ex- 
tension of the approach developed by Ewing and 
Green should be possible. For example, differentia- 
tion between mismatch and frameshift errors would 
enable better comparisons of sequencing methods 
with similar total error rates but different frameshift 
error rates. Several groups have described efforts to 
calculate separate probabilities (or "confidence as- 
sessments") for mismatch errors and frameshift er- 
rors (Lawrence and Solovyev 1994; Berno 1996). 
Their results demonstrated that different ap- 
proaches to error type characterization are feasible 
and promising. Implementation of such error type 
predictions in other programs similar to the way 
PHRED uses quality scores would enable better 
method assessments, benchmarking, and production 
quality control, and could have a significant impact 
on downstream uses of DNA sequence information. 

METHODS 
Data Sets 

For one project, sequence raw data in the form of 
ABI trace files were downloaded from a public FTP 
site. Sequence data for the five other projects were 
kindly provided by five different large-scale se- 
quencing groups. Table 1 gives a summary of the six 
projects, and Table 2 gives an overview of the dif- 
ferent sequencing methods used in the projects. The 
projects differed in the amount of prescreening of 
data that had been done, reflecting different ap- 
proaches to quality control in different laboratories. 
In two projects (B and C), different software pro- 
grams had been used to identify and eliminate low- 
quality sequences. One project (F) included all data 
files generated, whereas the other three projects had 
excluded "failed lanes." 

Comparison of Actual and Predicted Error Rates 

The sequences for all traces in each project were 
recalled using the program PHRED (v. 961028). 
Next, sequences in each project were assembled 
with PHRAP (P. Green, unpubl.). Slightly different 
methods were chosen for the statistical and graphi- 
cal evaluation of the error rate prediction accuracy. 
In the statistical evaluation, only the longest contig 
produced by PHRAP was considered. The tables of 
aligned bases and observed discrepancy counts for 
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each quality score were taken from the PHRAP out- 
put and analyzed as follows. The expected number 
of discrepancies (£) at each quality score (q) was cal- 
culated by multiplying the number of aligned bases 
(A/) with the error probability corresponding to the 
quality score: E= N 10 ~° lq . The Spearman ranking 
coefficients were calculated by comparing the ex- 
pected and observed error frequencies. To obtain 
the quantitative relation between the expected and 
observed error rates over the entire range, a least- 
squares fit between the observed and expected rates 
was performed, with the intercept set to zero and 
the number of aligned bases at each quality score 
used as weights. 

For a graphical comparison of estimated and ac- 
tual error rates in 50-bp windows, the following 
steps were taken. For two of the projects, the con- 
sensus sequence was retrieved from public data- 
bases. For the four other projects, the DNA sequence 
and quality information were used by the program 
PHRAP to assemble consensus sequences for each of 
the projects. The individual reads were aligned to 
the consensus sequences of the longest contig, us- 
ing the program CROSS_MATCH (P. Green, un- 
publ.), after removing single-coverage regions from 
the ends of the consensus sequence. CROSS- 
MATCH uses an implementation of the Smith- 
Waterman algorithm to generate alignments that 
typically do not include the ends of sequences, 
where disagreements are commonly due to vector 
sequence or low quality sequence. 

The quality files generated by PHRED and the 
alignment summaries generated by CROSS- 

MATCH were then analyzed as follows. First, the 

region of each query sequence that had been aligned 
by CROSS_MATCH was determined. Next, the actual 
and predicted error rates for the entire aligned part of 
each individual sequence was calculated. In addi- 
tion, the average actual and predicted error rates for 
all alignable sequences together were calculated for 
windows of 50 bases in length. To calculate the pre- 
dicted error rate, the quality scores q determined by 
PHRED at each base were converted to error prob- 
abilities as described above (Ewing and Green 1998). 

Subdividing Data into Subsets Based on Data Quality 

To examine the accuracy of PHRED quality scores 
for data subsets of different quality within a project, 
the following approach was taken. For all sequence 
reads in project B, the number of bases with a qual- 
ity score of at least 30 in each sequence was deter- 
mined (bases with quality scores of at least 30 were 
called very high-quality bases, or VHQ bases). Se- 



quences were sorted in descending order based on 
the number of very high-quality bases, and divided 
into four quartiles. Accordingly, quartile 1 con- 
tained 25% of sequences with the highest number 
of very high-quality bases, and quartile 4 contained 
the "worst" sequences. To illustrate the prediction 
accuracy in data with relatively high error rates, se- 
quences from project B that had been "discarded" 
because they had not met the minimum quality cri- 
teria were added back to the data set. The sequences 
in each quartile were compared to the consensus se- 
quences that had been generated using the entire data 
set, as described above for the graphical comparison. 

Determining Actual Frameshift Error Rates 

The calculation of actual frameshift error rates in 
the raw sequence data was performed using CROSS 
_MATCH, similar to the procedure described above 
for total error rates, except that only insertion and 
deletion errors were counted. Because PHRED does 
not give separate frameshift error estimates, a com- 
parison of predicted and actual frameshift errors is 
not possible. 
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This unit contains a general discussion of 
factors that should be considered before em- 
barking on a DNA sequencing project. In gen- 
eral, any sequencing strategy should include 
plans for sequencing both strands of the DNA 
fragment. Complementary strand confirmation 
leads to higher accuracy, especially when se- 
quencing regions where artifacts such as "com- 
pressions" are a problem. Sequencing the op- 
posite strand is often required to obtain accurate 
data for such regions. 

The most commonly used methods for gen- 
erating appropriately sized DNA fragments for 
dideoxy and chemical sequencing are dis- 
cussed below. The biochemistry underlying 
these procedures, as well as how to choose 
between these and alternative sequencing 
methods, are discussed in the introduction to 
this chapter. 

DIDEOXY SEQUENCING 

Planning for Dideoxy Sequencing 

Sequencing determination of a fragment of 
<500 nucleotides is usually straightforward be- 
cause this amount of sequence information can 
reliably be determined from a single set of 
sequencing reactions. Fragments of this size 
can usually be subcloned directly into an ap- 
propriate single- or double-stranded DNA se- 
quencing vector. If the vector generates single- 
stranded DNA, such as the M13mp vectors 
described below, the fragment should be cloned 
in both orientations so that both strands of the 
insert are produced as single- stranded DNA. A 
primer that hybridizes to a site on the vector 
adjacent to the insert DNA is then used to 
sequence both clones, generating the sequence 
of each strand. When sequencing double- 
stranded plasmid DNA, there are two options 
for obtaining the sequence of each strand, A 
single primer can be used if the insert DNA is 
cloned in both orientations. Alternatively, two 
primers that hybridize to plasmid sequences on 
opposite sides (and opposite strands) of the 
insert DNA can be used to sequence a single 
clone. 

To sequence larger regions of DNA com- 
pletely, it is generally necessary to subdivide a 
large fragment into smaller ones that can then 
be individually sequenced. Three general ap- 
proaches are currently used. In the first ap- 
proach, known as "shotgun cloning random 



fragments are created from longer DNA frag- 
ments by physical shear, digestion by nucle- 
ases, (e.g., DNase I) or by restriction digests 
with endonucleases that make frequent cuts in 
the fragment (e.g., those with four-base recog- 
nition specificity; Frischaufetal., 1980; Ander- 
son, 1981; Bankier and Barrell, 1983; Bankier 
etal., 1988; Hong, 1982; Messing, 1983, 1988; 
Deininger, 1983a, 1983b; Bankier, 1984; Lin 
et al., 1 985). These fragments are combined and 
the entire pool is li gated into the appropriate 
sequencing vector. After the DNA sequence of 
the various cloned fragments has been deter- 
mined, the final sequence is compiled by com- 
puter using overlapping information from the 
individual fragments (unit 7.7). However, with 
more complex (i.e., longer) sequences, this ap- 
proach becomes tedious since it requires puri- 
fying, ligating, and cloning numerous individ- 
ual fragments. In addition, finding the final few 
percent of a sequence by this procedure can 
consume a disproportionately large amount of 
time. 

A second subcloning strategy for sequenc- 
ing large DNA fragments is to generate an 
ordered set of subclones from a large DNA 
molecule. This is usually done by making pro- 
gressive (nested) sets of deletions from a clone 
containing the entire DNA fragment to be se- 
quenced. Numerous protocols exist for making 
nested deletions by enzymatic means; two such 
protocols using exonuclease III (Henikoff, 
1984; Guo and Wu, 1982; Okita, 1985; Ozkay- 
nak and Putney, 1987; Smith, 1979, 1980; 
Strauss and Zagurski, 1991) and nuclease Bal 
31 (Guo andWu, 1982; Guo etal., 1983; Vocke 
and Bastia, 1983; Yanisch-Perron et al. f 1985; 
Poncz et al., 1982; Misra, 1985) are presented 
in unit 7.2. Another enzymatic method for mak- 
ing nested deletions utilizes T4 DNA polym- 
erase (Dale et al., 1985). Other methods for 
isolating nested sets of deletion fragments in- 
clude size-selection of inserts (Barnes, 1987; 
Barnes and Bevan, 1983; Barnes et al., 1983; 
Vocke and Bastia, 1983), oriented restriction 
fragment subcloning (Yanisch-Perron et al., 
1985; Lee and Lee, 1989; Hoheisel and Pohl, 
1986), transposon-mediated deletions (Ah- 
med, 1987a, 1987b; Nag etal., 1988; Peng and 
Wu, 1 986), progressive synthesis (Burton et al., 
1988; Liu and Hackett, 1989), and oligonu- 
cleotide-^ irected mutagenesis (Shen and Waye, 
1988; Wang et al., 1989). Commercial kits for 
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to search sequences for regions that are complementary to synthetic 
oligonucleotides. 

Accuracy of the sequence 

When DNA sequencing is carried out carefully, the error rate is less than 
0.1%. However, to achieve this high rate of accuracy, it is necessary to 
sequence both strands of the target DNA completely and to resolve all 
ambiguities and discrepancies. In this respect, random sequencing has an 
advantage, since the gradual accumulation of redundant primary sequences 
greatly improves the accuracy of the final assembled sequence. However, 
there may be regions of the target DNA that cannot be sequenced accurately 
by either the random method or directed methods. Resolving these difficult 
sequences often takes a surprisingly long time and sometimes requires the 
use of base analogs (to eliminate compressions) or Maxam-Gilbert se- 
quencing. 

Future direction of the project 

Different sequencing strategies yield different types of material that can be 
used in later experiments. For example, nested sets of deletions generated 
for DNA sequencing can be used to study the domains within a promoter 
region or sets of oligonucleotides complementary to different regions of the 
target fragment can be used to sequence mutant forms of the target sequence. 
Random subclones created for shotgun sequencing provide a store of material 
that can subsequently be used for site-directed mutagenesis or for the 
generation of radiolabeled probes. 
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Summary 

The complete 9193-nucleotide sequence of the prob- 
able causative agent of AIDS, tymphadenopathy-asso- 
ciated virus (LAV), has been determined. The deduced 
genetic structure is unique: it shows, in addition to the 
retroviral gag, pol, and env genes, two novel open 
reading frames we call Q and F. Remarkably, Q is lo- 
cated between pol and env and F is half-encoded by 
the U3 element of the LTR. These data place LAV apart 
from the previously characterized family of human 
T cell leukemia/Iymphoma viruses. 

Introduction 

The recent onset of severe opportunistic infections among 
previously healthy male homosexuals has ted to the char- 
acterization of the acquired immune deficiency syndrome 
(AIDS) (Gottlieb et at., 1981; Masur et al., 1981). The dis- 
ease has spread dramatically, and new high-risk groups 
have been identified: patients receiving blood products, 
intravenous drug addicts, and individuals originating from 
Haiti and Central Africa (Piot et al., 1984). AIDS is a fatal 
disease, and there is at present no specific treatment. The 
causative agent was suspected to be of viral origin since 
the epidemiological pattern of AIDS was consistent with 
a transmissible disease, and cases had been reported af- 
ter treatment involving ultrafiltered anti-hemophilia prepa- 
rations (Daly and Scott, 1983). A decisive step in AIDS re- 
search was the discovery of a novel human retrovirus 
called lymphadenopathy-associated virus (LAV) (Barre- 
Sinoussi et al., 1983). The properties of the virus consis- 
tent with its etiological role in AIDS are: the recovery of 
many independent isolates from patients with AIDS or 
related diseases (Montagnier et al., 1984); high LAV 
seropositive among these populations (BrunAtezinet et 
al. ( 1984); a tropism.and cytopathic effect in vitro for the 
heiper/inducer T-lymphocyte subset T4 (Klatzmann et al., 
1984), also found depleted in vivo. 

Other groups have reported the isolation of human 
retroviruses, the human T cell leukemia/lymphoma/lym- 
photropic virus type III (HTLV-HI) (Popovic et al., 1984) and 
the AIDS-associated retrovirus (ARV), which display bio- 
logical and sero-epidemiological properties very similar tc 
if not identical with those of LAV (Levy et al., 1984; Popovic 
et al., 1984; Schupbach et al., 1984). Both LAV and HTLV- 



III genomes have been moleculariy cloned (Alizon et al., 
1984; Hahn et al.. 1984). Their restriction maps show 
remarkable agreement, including a Hind III restriction site 
polymorphism, bearing in mind the variability of this virus 
(Shaw et al., 1984) and confirming that these two viruses 
represent a single viral lineage. 

In addition to its obvious diagnostic and therapeutic 
potential, the LAV DNA nucleotide sequence is essential 
to an understanding of the genetics and molecular biology 
of the virus and its classification among retroviruses. We 
report here the complete 9193-nucleotide sequence of the 
LAV genome established from cloned proviral DNA. 

Results 

DNA Sequence and Organization of the LAV Genome 
We have reported previously the molecular cloning of both 
cDNA arid integrated proviral forms of LAV (Alizon et al., 
1984). The recombinant phage clones were isolated from 
a genomic library of LAV-infected human T-lymphocyte 
DNA partially digested by Hind III. The insert of recom- 
binant phage JLJ19 was generated by Hind III cleavage 
within the R element of the long terminal repeat (LTR). 
Thus each extremity of the insert contains one part of the 
LTR. We have eliminated the possibility of clustered Hind 
Ml sites within R by sequencing part of an LAV cDNA 
clone, pLAV 75 (Alizon et al., 1984), corresponding to this 
region (data not shown). Thus the total sequence informa- 
tion of the LAV genome can be derived from the AJ19 
clone. 

Using the M13 shotgun cloning and dideoxy chain ter- 
mination method (Sanger et al., 1977), we have deter- 
mined the nucleotide sequence of AJ19 insert. The recon- 
structed viral genome with two copies of the R sequence 
is 9193 nucleotides long. The numbering system starts at 
the cap site (see befow) of virion RNA (Figure 1). 

The viral (+) strand contains the statutory retroviral 
genes encoding the core structural proteins (gag), reverse 
transcriptase (pol), and envelope protein (env), and two 
extra open reading frames (orf) that we call Q and F (Table 
1). The genetic organization of LAV, 5'LTR-gag-pol-Q-env- 
F-3'LTR, is unique. Whereas in all replication-competent 
retroviruses pol and env genes overlap, in LAV they are 
separated by orf G (192 amino acids) followed by four 
small (<100 triplets) orf. The orf F (206 amino acids) 
slightly overlaps the 3' end of env and is remarkable in that 
it is half-encoded by the U3 region of the LTR. 

Such a structure clearly places LAV apart from previ- 
ously sequenced retroviruses (Figure 2). The (-) strand is 
apparently noncoding. The additional Hind 111 site of the 
LAV clone AJ81 (with respect to AJ19) maps to the appar- 
ently noncoding region between Q and env (positions 
5166-5745). Starting at position 5501 is a sequence 
(AAGCCT) that differs by a single base (underlined) from 
the Hind HI recognition sequence. It is anticipated that 
many of the restriction site polymorphisms between differ- 
ent isolates will map to this region. 
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HiiyJIH 

CCTCTCTCTCCnAGACCACAmCACCCTCGCACCTCT^ 

100 

CTCaCTCTCGTAACTACACATCCCTCAGaCCCTTT7ACTCACTCTCCA>^ 

. . .200 

0AG»- l«uAUCluAUAfiArtArgGliiUtfclyAUAriAla$«rValLeu5«r 

<^CTCGCCnCCTCA>CCCCCCACCCCAACAGGCCAGCCGA6^ 

300 ..... 
Cl y ClyCUl<euAipArjTfpCluLy*a*Ar|UttAr|rtoClyClytytLyiLyiTyrLy»L«uLyiKiineV«lTrpAl*ScrArgCluL«uCluArtPh«Al«V*lAiorroCly 

400 ....... 

L«uLeu<:UThrS«fCUClyCy»Ar8ClnIttUuCiyClnL«uClnProS«rL« y ClaThrCl.yS«rCUCUL*uArjS#fLeuTyrA»nThrV«UUThrL«uTyrCy»V*lHi* 
CCCTCTTACAA>CATCACAACCCTCTACAauUTACTCC^ACACCT^ 

500 . . , , . , . , . 600 

CliJlr*IUCWU«Ly»A*pThrLyt&luAlaUuA*ply«ll«CluGluCluGlnAinL^^ 

A tcaaaccaTacaca taaaacaCac caaccaacctttacacaacatagacgaacag caaaacaaaactaacaaaaaaccacaccaaccaccacctcacacagcacacaccacccacctca 

• _ : : « .... 700 . . 

GlnA»oTyi irroUeValGlnA>nn«CUCUGlaH«tyalHiaGlnAUn«g trProArmrL€^ 

cc caaaattacc CT A TAC TCCACAACATC CACGCCCAAATCCTACATC ACCCCATATC AC CTACAACTTTAAATGCATC CCTAAAACT ACTACAAGAGAACCCTTTCACC C CAGAAGTC A 

«O0 

ProM»c fbcS trAlaL«uS»rCUClyAUThfProClBUUpL«uA«nThrMtcLtuA<oThrV<lClyClyHiiCtftAl«AliM«tCUM«tUuI.y»CUTbf lUAaaCUCluAU 
TACCaTCTTTrCACCArTATCA(^ACCACCCACCCaCAA^TTTAAAiCACCATGCTAAAC^ 

900 .... 

AUCluTrpA»pAr$V«Ui»ProVilHi»AUClyProIltAUrToClyClnHttArfCluProAr|ClyS»fA«pIl«AliClyThrThc$«rThrUuCUClttCUll«ClyTrp 
CTCCACAATCCCATACACTCOTCCATOCATCCACGCCCTATTCCA^ 

1000 ....... 

M«tTbrAiaA«oProProIltProV«lClYCluIl«TyTLTtAr^Tr ? ll«lUUuClyUuA*nLyiIUV*lAr|M«tTyrS«rProThrS«rU.tUuAipIltArtCiDClyrr^ 
GGATGACAAATAATC CAC ctatc c cagtagcacaaatttataaaacatcgataatc ctc ccattaaataaaatactaacaatctatacc c CTAC CaCCATTCTCCACATAACACAAGGAC 
H00 ......... 1200 

Ly$Glu*r©P>t^rtA«pTyrV«lA«pArsPbtTyrly»nrUu^ 
CAAA*CAACCCTTTAC*CACTATCTAaCCCOTCTATAAAjCTCTAACA 

1300 

Cy»Ly«Tbf I UU uLy»Al*L<uClyProAl^«ThrUuCluClu>UtM«tT^fAl<CytCUClyy«lClyClyfroCtyHiity<Al«ArtV«lUuAl4ClaAl*H«tS«rCU 
ArTCTAACACTATmAAAAGCAmG^CCACCACCTACAra 

1400 .... 

V*iThrAaaS«rAUThrIUMttM»eGlttAriClyA«offctAx|AA0CliLAri^ 
AAGTAACAAATTCAGCTACCATAATGATCCA>ACACCCJUTm^ 

1500 ..... 

POL »~ ?bt?btAr|GVuA»pL«uAUPhtUuGlnCiyLy»AlaAx|CluPtJ«S«r 
Ly»Ly»ClyCy»TrpLysCyfClyLyiCUClyBiiCloll»tLytA«pCyiThrCluArjCinAUA»nPhtLfoClyl.y«Il«TrpPr(>S«rTyrLy l ClyAr|ProCiyAinPh*L«u 

ccaaaaagcgctgttcca>atgtcgaaaccaaggacaccaaatgaaacattgtactgw 

1600 ....... 

StrCWCUThrArgAUAjaStrPromAriArtClul^lBValTrpClyAi^apAa^ 
Clo5«rAriProCluProThrAUProProCluCluS«cPk«AxtStrClyVilCUThrThrThrProS*rCULy«CloGluProntA«pLyiGUt*uTyrProL«uThrS«rL«vi 
TTCACACCAGACCAGAWCAACAGCCCCACCACAACAGACCTICACCTra 

i?oo . . . . . , ... . ieoo 

G\alUThrLtuTrpGl(UrtProLfuV a lTbrUtLytIl«ClyClyClDUuLy*CLuAUUuUuA*pThrGLyAl4AtpA«pThrV«lUuCluCluH«tScrUuProClyAr| 
Arg$crL*uPhtClyA«aA«pProScrScrGla * 
TCACATCACTCTmCCAACCACCCCTCGTCACAATAAA<U^ 

* ........ 1900 

TrpLytProLy^£^UClyClylUClyC lyPht IltLy«V«lArtClDTyrAipGlolieLcuU<GVuIltCytGlytti,»LytAUIltG;yTbtV»lUttV»VCVyPTO , n>rfro 

ATCCAAA£CAAAAATGATACCCGGAAnGGACGTTTTATCAAAfiTAAGACA<nrATG^ 

2000 .... 

V*tAioIl«IUGlyArjA*tiLeuL«uThrGlaa«ClyCytThrUuA»oPheProll«SerProlUGluThrV«lProValLy»L«ul.y»Pr9ClyM«tA«pGlyProLy»ValLyi 
rc TCAACAT aattgcaagaaatctcttcactcagattccttgcactttaaattttc c CATTACTC CT ATTCAAACTG TAC CACTAAAATTAAaCCCACGAATCGATCCC c caaaacttaa 

2100 ...... 

ClnTrpProLeuThrCluCULyiIULy*AUI.*uV*lCUIl«Cy«ThrCluM«cGULy»CUClyLytntS«rLyfIleClyProCiuAsflProTyrAioThrPrpValPhtAU 
ACAATCGCaTTC^UCAAC^AAAAATAAAACCATTACTACAAAmCTACACAAATCCAAAACGA^ 

2200 ........ 

Il«Ly*LytLy«A*pSerThrLy*TrpArtLy»L««V*lA*pP^tAr»CluLeuA|nLyiArjThrClnAipPhtTrpCtgV«lClriLtuGlylUProHi«ProAl*ClyUuLy»Ly» 
CATAAWiAAAAAAGACACTACTAA>TCCACAAAAT7ACTA^rrrCACACA>CTTAATAACACMCTC 

2300 ......... 2400 

Ly*Ly»SerVatT*rValUuA*pV«lClyAtpAUTyrm5«rVtlProleuAipCl^ 
<^/AAAATCACTAACACTACTCCATCTCC<rrGATCCATATTm^ - 

2500 . ^ 

AriTyfClnTyrA*oV«lL«uProCUClyTrpUy«ClyS«tPcoAlaIltPh«C\nStrS«iK«tThrLyaneL«uG\uProPheArjLyiClaAaoPfoAapXleValIl«TyrCU 
t ACAT A tcagtacaa TGTCCTTC cacagccatccaaaccatcaccagcaatattc C AAACTAC ca TGACAAAAATCTTAC AGC ctttt agaaaacaaaatc cacacatacttatctatca 

. . . 2600 .... 

TyTMetAipA*pUuTyrV*iClyS«rAipL«uClun»ClyClnHiaAr|ThrLy*IUCiuCluLeuArgClnlliiLeuLeiLArgTf pGlyUuTarTbrProA»pLyaLyaHi*CU 
aTACATCCaTCaTTTGTATCTAGCATCTCaCTIACA>ATACOC<ACCATAGA^ 

2700 ...... 

LyiCluProProPheUwTrpM«tClyTyrCluLtuMiafrMapCy»TrpThrV«lCioProUfV*lUuProCluLyiAapS«rTrpThrVilAanAJpIUClatyaL«uV*lCly 

caaacaacctccattcctttggatcccttatcaactccatccwutaaatccac^ 

2800 ........ 

LytLeuA«aTrpAUScrClnIUTyrProGlyIlcLytV4lArgGlaLtuCy<Ly»LtuLeuArfGlyTbrLyaALaUurhrCUV«lIlePr(>l<curbrCluCliiAl«CluL*uClu 
AJ^TTCAATTCCCCAAGTeACAmACCCACGCATTAAACTA^ 

2900 . , . . . . . . . 3000 

LeuAljCluA»<vArgCUIl*UuLyfCUProV«lHUClyfaiTyfTyrA«pProS«rLy*A«pL«utV«AUGlutl«GlnLy»ClaGVyCloClyClDltpThrTyfGiftIUTyr 

actggcagaaaacacacacattctaaaacaaccagtacatccactgtattatcaccc^ 

3100 

CUCUPtoPh«LyiAsaL«uLysThtClyly»TyrAUArsThrAr|CiyAUHUThr^^ 

tcaacacccatttaaaaatctgaaaacaccaaaatatccaacaaccacccctccccacactaatcatgtaaaacaattaaca 

3200 .... 

TrpGl y LyjThrProLy>Ph«LyaLeuProIltClaLytCluThrTrpCluThrTrpTrpThrCluTyrTrpGlaAUThrTrpntProGluTrpCluPb«V*U»oThrProffoLeu 
A TG CCCAAAGACTCaAA>TTrAAACTACCCATACAAAACCAAACATCGCA>WCATCCTCGACACACTATTC CCA>GCaCCTGCATTCCT(^CTCGCACTTrCTCAATACCCCTCCTTT 

3300 ...... 

v*lLy*L€urrpryrClnL«uCULyiCUProIitV«lGlyAlaCluTl»rPh»TyrV*U»pClyAUAlaS€rArjCUThrLyiL«uClytyaAl«ClyTyrValThrA*ftArjCly 

actcaaattatcctaccacttacacaaacaacccatactaccaccacaaacgttctatgtacatcgcccacctacc^ 

3400 ........ 

AfgClaLytValV«lThrLeuThrA«pmThrAioClnLy$ThrCU^^ 
AACACAAAAACTTCTCACCCTAACTCACACAACAAATCAftLACACTCACTTAC^ 

3500 ......... 3600 



Nucleotide Sequence of the AIDS Virus, lav 
i 1 



L«uClTXlc£lcClaAUCUPtoA»pLy*S<rCluS«rCLutcuVaU«aCloIUIUCUCl(iLeuncLyiLytCluLr«V«lTyrLtuAl«TrpV«lProAl«ai«Ly«ClyIU 
AJTAGCyUTCATTCAACCACAAC(U^ 

3700 

GlyCiyAiDCluClQV4U*pLy»UuV*U«rAUClyn«AriLyiV«lUu^ 

- 3800 . ■ ■ . 

lIiS«rJU pfb«A jaLtuPcQpf oV<l V«lAULy > GlulleV*IAliS«rCyiA,tpLy»CytGULtuLy»CW^l^UH*cttitGlyClaV»LljpCytS«r froCl ? LUTr^CU 
C^TACTCAnTTAACCTWCACirrCTACTACC^^ 

3900 . . . . . . 

UaAjpCy»rtarHiiL«uCluClylyiVtUWUuV*lAUVAlHi»V«U^ 
ACTACAlTCTACACAmACAACOAAAACmTCCTGCTACCAC^^ 1 1 1 

4000 ........ 

Ly*L«ttAUCiyAftTrpProV«iLyiThrILfHiiTbrA»pAioClyS«rAjQPh«ThrS«rThrThrV»lLyiAUAl*Cy»TrpTrpAUClyIULy»ClBClu?b«Clyll«Pro 
jXAAlTACCACGlACATCCCCAGTAAAAACAATAa^ 

*100 ... . • . . . , . . 4200 

TyrA*oPr^ioS«rCloClyV»lV«lCUS«rM*tA«oLy»CUL«uLy i l.y*IUU«CiyCUV»lAriA«pCltiAUCluai»LtuLytTbrAl4V«lCio>UtAl*V«lPhfIlt 
CTACAATC CCCaa^CTCAACGACT ACTACAATCT ATCAAT AAA&AATt AAAGAAAATTATAC CC CAC CTAACAOATC AC C^CAACJkTCTTAACACACCACTACAAATC (XACTATTCAT 

4100 

Hi^*mLy»ArtLyiCtyCtylWGlyClyTyrStrAUGlyGluAxtll«V*lA»pIl^^ 
CCAC^mtAAAACAAAACCCCC<^rrCC«WAaCTCCACCCC^^ 

4400 .... 
Pb«ArfniTyrTyrAriA*pStrArtA«pProL«uTrpLy»ClyProAl*Ly»LtuUttTrpLy»ClyCUClyAUV«lV«lIitClnA*pA«aS«TA«pIl«Ly»V«lV«lEroAri 

ORF Q»-CyiClaCI 

rrrTCCCCTrTATTACACCCACACCACACATCCACTTTCCAAACCACCAC^A^ 

4500 ...... 

4rsLyiAX«LyiXltU«AxsA«pTj^yLytClaH«CAl«ClyA«pA«pCy«V«Ul«$crAx|ClaAfpCluA*p • 
Clui.7»ClaAxjStrUuClyIi^^luAiaAr|TrpCloV*lM*tIl«V a lTrpClnV*iA*pAr|M«cAr|IUArtThrTrpLyiS«rUuV*lLy*«i«Hi*M«cTyrV*lSf 
AAG^AAACC^AACATCAmCCCAmTGCaAAACAaTC 

4600 ........ 

GlyLyiAl»ArjClyTrpPbeTyrAf|ai»ai»TyrC" jS«rProHifProAr|IUS«rS*rGluV*lHii IltProLtuCl yAjpAl AArtL«uV«l IUThr TbrTyrTrpClyU 
CAGOUA>OTACC<XUrcCTmATAGACATCACT^ 

4700 ......... 4800 

aitTbrClyGluArjA«pTrplliiUuCly6UClyniS«rU«CluTrpAr|L^^ 
T<XATACAC<ttUAA(UGACTCCCATCTCCOTA 

. . . . . . 4900 

A«pC r»Fh«$t rAjpS«rAUIl«ArtLytAUUuL«uClyHi»IUV«lS^ 
TTGACTCTrrTTCA<^CtCTCCTATAA£AAAC« 

5000 .... 
L«uIltT^rPToLyiLytIULytProProUu?rQStrV«lTbrLyiLmiThfCluAjpArtTrpA>oLy*ProClntytThrLy>ClyHitAr|CIyS«rHi»Tbr»UtAjDClyH ] , 
CATTAAXAACAC CAAAAAAGATAAACC CAC CTTTCC CTACTCTf ACGAAACTGACACACCATACATC GAACAAGC CCGAGAACACCAAG GCCCACaCAC GGAGC CA CACAATCAATC CAC 
• . . . . . 5100 . 

actacaccttttacaccagcttaacaatcaacctgttacacattttc 

5:00 ........ 

CATAATAAXlAATTCTCCAACAACTCCTCTTTATCCATTTCAGAATTCCCTCTCCAC^ 

5300 . . , ' . . . . . $400 

ACC CCTCCAACCATCC^CCAACTCACCCTAAAACTCCTTCTAC CACTTGCTATTGTAAAAACTGIT GCI T T C ATTCCCAA CIX T OI T tC ACAACAJUU^CTtAXXCATCTCCTATCCCA 

- . . 5500 

«UACAACCCCA£UCACCCACCAACACCTCCTCAACC^ 

5600 ' . 

ybjN§- LytCUClotytTbr 

TAGTACCAATAATAATACCAATACTTCTCTCGTCCiTACT^ 

5700 ...... 

V.lAijg^|V«tLyfCluLy»TyrCloai«L«uTrpArfTrpClyTrpLy«TrpCtyTbrM«tL«uL«uClyIltLeuM«tIleCytStrAUThrCluty«UuTrpV*lThrVil 
CTCCXUATGACAGTt^AGCACJLMTATCAGCACTTGTCCAC^W 

5800 . . . . 

TyrTyrClyf»lProV»UrpL7»CiuAi*TfctThtTbrUAiPta^ 
TAmrW^ACCTCTCTCCAACCAACCAACCACCACT^ 

5900 ........ 6000 

ProA^aPToClaCiuV4lV«lUuV*U»oVaLTbrCiuA*&Pb«Aji^ 

cccxacccacaacaactagtattcctaaatctgac^gaaaattttaac^ 

. . . . , 6100 

Cy»V*lLy«t^TbrProLeuCyiV«l5«rLtuLy»CyiTbrA*pUu 
TCTCTAAAATTAACCCCACTCTCTCTTACTTTAAACTCCACrrCATTlT;CCCAATCC^ 

* j ..... 6200 . . 

IL«Ly*A*oCyi$«rPb«A4cIltS«rTTirStr Il«AxtClyLyiVtlGlni.yt 

aTAAAAAACTCCTCTTTCAATATCACCACAACCATAACAC CTAACCTCCACAAACAATATCCATTTTTTTATAAACTTCATATAATAC CAATACATAATGATACT AC CACCTAIACCTTC 
: .... 6300 .... _j . 

IhrScrCyiAAoThrS.rVillUTTirClaAUCyiProLy^ilS.rPh.CXuProIUProIUUiiTyrCytAUProAUClyPh^UrUL^uLy.CyiAaQAiaLy.rbrPhe 

ACAACTTCTAACACCTCAGTCATTACACAC gc ctctc caaac gtatcctttgacc caattc c CATACA ttattctcc c c c C GCTC Gil 1 IC CCATTCTAAAATCTAATAATAACACCTTC 

. . . 6400 .... . 

A«aGlyTkr«yProCyfTfcrA*oVtl$trTfcrV*lClttCyiTbrHiiGlyIUA^ 

AATC * »A> rACn ACCATCTACAAATCTCACCACACTACAATCTACACATCCAATTACCCCACT 

. 6500 . . . . . . . . 6600 

$ c r Al &A*aPta TbrA* p Aj tUl • Ly » Tfcr II 1 1 1 1 V« I C I oW\Us oG ^ ^ 

TCTCCCAATTTCACACACAATCCTAAAACG^AATACTACACCT 

. . - . 6700 

ClyATtAl*Ptat7«lThrntGtyLyiUtClyA»niUtArtClaAi4HUCyiA.aUt$^ 
CCCA£a£CATTTCTTACAATA£CJLAAAATACCAAAT^ 

. ...... 6800 . . . . 

ClyA*aAjalytmiltntPb«Ly»CUStr$«rGlyGlyA*pProCtuIl«V*lTferBi«S«r^^ 
CCAAATAATAAAACAATAATCTTTAACCAATCCTCACCACCCCACCC^ 

_ j . • . . 6900 ...... 

TTxrTrpPh*Ajo$*rThrTrpS«rTbrCluClyS«rA«aA*aTbrGluC^ 

ACTTCCTTTAAt ACTACTTCCaCT ACTCAAC GCTC AAATAACACTGAAC CAAGTC ACACAATCACACT C C CATCCACAATAAAACAATTTATAAAC^TCTC CCAGGAaCT AC CAAAAC Ca 

7000 j_ . . . . 

ATCTATCCCCCrCCarCACCCaCAAAreAGATCTTCATCA^ 

7100 . . . . , t „ , .7 200 

Ajp^iAr^«pA«nTrpAxgS«rCiuLfuTyrLy»Tyrly»V'*lV»lL^^ 

caiatcacccacaattccacaactcaattatataaatataaactactaaaaattca^ 

7300 



12 



GUI WClyAliL«uPh*UuClyrhtL«uClyAl^UClyS«rThrH*c^^ 
CCMtACGACCTTTCTTCCTTCCOTCTTCCaCCUCCACGJUCCACTATCC^ 

. . . . , 7400 .... 

A*oLcuL*uArtAl«lUClaAUClaClaHUL«uL«uClQi«uT^rV*UrpClyIULy.CUL«uClQAUAxsIUUuAUV,lClu>r|TyrUuLy»A»pClaCUl.«uUu 
AAmCCTCACCGCTArttUCCCCXJUCJtCCATCTCTTC 

. „ • . . . 7300 . . , . 

ClyneTrpCtyCy*ScrClyLytLtuLl«Cy«ThrTtttAUV*irTotrpAanAl«ScrTrpStrA«oLyiS«rUuCluCUIlcTr?AjfU«a^cThrTrpM*tCluTrpA*pArt 
CC^mCC^OTCCTCTCCXAAACTCArTTCacUCTCCTCTCCCTTC^ 

. . . 7600 . . , . . . - . 

GWtl*At^* oT r r ^ rS «r^ulltltiiS*rUuXWGluCluS«rClaA«nClaCla^ 

C AAATT AACAATTACACAACCTT AAT ACATTC CTTAATTGAACAATC C CAAAAC CA CCAA Ct A A AAi^ULTCAACAACAATTATTCCAATTACATAAA 

. 7700 . ....... 7800 

AioIWrfcrA^oTrpUuTrpTyrlULyiUtPhtntM^tU^ValCljClyU^ 

AACATAACAAATTC CCTCTCCTATAtAAAAATATTCATAATCATACTAG CAjCCCrrCCTACCTrrAACAATACTTTTTCCTCTACTTTCT AtACTCAATACAGTTAC CCACCGATATTCA 

7900 . 

PraL«uS«rPh«CUTTirUitL«uProThrfroAjr|ClyPToAipAT|ProCl uGlyll«CluGluCluCl7GlyClaAxjA«pAx gAjpArf S«rXltArgL«uV«LA«aCl]r$trUu 
CCaTTATCCTTTCACACCCACCTCCauCCCC&tfCC^ 

... . . . 8000 .... 

Al*U\*IWTrpA*pAfpL«uAr|S«rUuC7<Uufh€S«rTyrHi#AxjL«u^ 

GCACTTATCTCCGACGATCTGCCGAjGCCTCTCCCTCTTCACCfACCAC CCCTTCACACACTTA C 1 I I IC ATTGTAACCACCATTCTC CAACrTCTGCCACCCACGCCGTC CCAAGG CCTC 

8100 ..... 

Ly<TyrTrpTrpAiaUuUuClQTyrTrp$*rCUCluL«utyfA«nS«rAl«V^ 
AAAlATTCGTCCAATCTCCTACACTATTCGAGTCACGAACTAAACAATACtCCtGTt^ 

8200 ........ 

CUGiyAUCyiArtAl«Il«AriHi«IUfroArvArttl«ArtClnClyUviCluAx|Il«UttL*» • 

QRF F A»pAr(Ai*Tr P Ly»GlyPbtCyiTyTLy^3Cl7GlyLy»Tf pS«rLy»S«rS«rV*lT«tClyTrpfroThrV«l 
CAAGCACCTTCTAGAGCTATTCCCCACAtACCTACJUtfJ^ 

8300 . . . . • . . * **O0 

Ar»CluAr**fatAx»Ar»Ai«Cltt*roAaaAl4A*pfil7V«ICl7Al^«S«rA^ 
AAGGCAAACAATCAGACCACCTCACCCAC<UCaCATCCCCTCC<^CCAC<^ 4A A ACi TCCA/XAATCACAACTACCAATAJCACGACCtACCAATGCIGC 1 HI ICC 

8500 

TrpUuCluAUGlnGUCluCUGUT«lClyFhi>roT«lThgFTgClttT«l^ 
mmh r , k ^rA r. a A^^r.r^r^r.rmtCTrTrrr^^^ i TCACT T A fl AACGf ACCTCTACATCTTACCCACTTTT T 4AA iT.A A A AfiCO fl CC 

8600 .... 

UuGWClyUttlWHiiScrGlttArfAxgCloAjpIWUttAjpUuTrplU^ 
AWCAASC^AATTCAt^CCCAAOUAC^^ 

8700 . - . . . * 

U*Tbrrh«CiyTrpCy»T7rLy«UuValFroniCluProA*^ 
ACTGACCTTTCGATGGTCCIACAAGCTACTACCACTTCACI * UlC .A G A r .A A f A rCAGCTTCrrACACCCTCTCACCCTC^IGCAATCGATCACCC 

8800 ........ 

GluAriCluV*XUuCluTrpAxirb«A«pS«rAr»UuAi^«EiiBi«V«^ • 

TCACAWCAACTCTTACACTGCACCTTtGAa^ 

8900 . * ' . . . • . • 9° 00 

CCCTGG<XACTrTCCACGCACCCGTCGCCTCG«CCCACrCGCCACTCCCGAGGCCT 

. HiftdHI .... 9100 
CACCCrCCCAGCTCTCTCCCTAACTACCCAACCCACTCCTTAAGCC^rCAATA 

MM 

Figure 1. Complete DNA Sequence of Viral Genome (LA\Ma) 

The sequence was reconstructed from the sequence of phage AJ19 insert. The numbering starts at the cap site, which was located experimentally 
(see above). Important genetic elements, major open reading frames, and therr predicted products are indicated together with the Hind III cloning 
sites. The potential glycosylate sites in the env gene are overlined. The NH,- terminal sequence of determined by protein m.crosequencing 
is boxed (Genetic Systems, personal communication). 

Each nucleotide was sequenced on average 5.3 times: a5<W> of the sequence was determined on both strands and the remainder was sequenced 
at least twice from independent clones. The base composition is T, 22.2%; C 178%; A, 35.8%; G. 24.2%; G + C. 42%. The dinucleotide CpG 
is greatly under-represented (0.9%) as is common among eukaryotic sequences (Bird, 1980). 



The LTR 

The organization of a reconstructed LTR and viral flanking 
elements are shown schematically in Figure 3. The LTR is 
638 bp long and displays usual features (Chen and Barker, 
1984): it is bounded by an inverted repeat (5'ACTG) includ- 
ing the conserved TG dinucleotide (Temin, 1981); adjacent 
to 5' LTR is the tRNA primer binding site (PBS), com-' 
plementary to tRNA'? (Raba et ai., 1979); adjacent to 3' 
LTR is a perfect 15 bp poly purine tract. The other three 



polypurine tracts observed between nucleotides 
8200-8800 are not followed by a sequence that is com- 
plementary to that just preceding the PBS. 

The limits of U5, R, and U3 elements were determined - 
as follows. U5 is located between PBS and the polyadeny* 
lation site established from the sequence of the 3' end of 
oligo(dT>primed LAV cDNA (Alizon et ai., 1984). Thus U5 
is 84 bp long. The length of R+U5 was determined by syn- 
thesizing tRNA-primed LAV cDNA, After alkaline hydroly- 



Tabie l . Locations and Sizes ol Viral Open Reading Frames 



or l 1« Triplet Met Stop No. Amino Acids M. Calc. 

372 336 vms So 55.841 

Ml 1631 1,934 4,640 (1,003) 013.629) 

orfQ 4,554 4,587 5.163 192 22.487 

5.746 5.767 8.350 861 97,376 

Ort F 8.324 8,354 8,972 206 23,318 

The nucleotide coord>nates reier 10 the first base of the first triplet <i« triplet), of the first methion.ne {in.tiat.on) codon (Met) and of the slop coco* 
(Stop). The numbers of amino acid* and molecular weights are those calculated for unmodified precursor products starting at the first methionine 
through to the end. with the exception of pol, where the size and M, refer to that of the whole orf. 



13 



Figure 2. Comparison of the Genome Organization of LAV with Those 
of Human T CeH Leukemia/Lymphoma Virus Type I (HTLV-1) (Seiki et 
ah, 1983), Moloney Murine Leukemia Virus (MoMuLV) (Shinnick et al., 
1981), and Rous Sarcoma Virus (RSV) (Schwartz et ai., 1983) 
The positions and sizes of viral genes are drawn to scaie (open boxes) 
and the viral genomes {RNA forms) are delimited by brackets. 

sis of the primer, R+U5 was found to be 181 ±1 bp (Fig- 
ure 4). Thus R is 97 bp long and the cap site at its 5' end 
can be located. Finally. U3 is 456 bp long. The LAV LTR 
also contains characteristic regulatory elements: a poly- 
adenylation signal sequence AATAAA 19 bp from the R-U5 
junction, and the sequence ATATAAG, which is very likely 
the TATA box, 22 bp 5' of the cap site. There are no long 
direct repeats within the LTR. Interestingly, the LAV LTR 
shows some similarities to that of the mouse mammary tu- 
mor virus (MMTV) (Donehower et ai., 1981). They both use 
tRIMA'S* as a primer for (-) strand synthesis, whereas all 
other exogenous mammalian retroviruses known to date 
use tRNAP™ (Chen and Barker, 1984). They possess very 
similar poly purine tracts; that of LAV is AAAAGAAAAGG- 
GGGG while that of MMTV is AAAAAAGAAAAAAGGGGG. 
It is probable that the viral <+) strand synthesis is discon- 
tinuous since the poly purine tract flanking the U3 element 
of the 3'LTR is found exactly duplicated in the 3' end of orf 
pol, at 4331-4346. In addition, MMTV and LAV are excep- 
tional in that the U3 element can encode an orf. In the 
case of MMTV, U3 contains the whole orf while, in LAV, U3 
contains 110 codons of the 3' half of orf F. 

Viral Proteins 

gag 

Near the 5' extremity of the gag orf is a "typical" initiation 
codon (Kozak. 1984) (position 336). which is not only the 
first in the gag orf, but the first from the cap site. The 
precursor protein is 500 amino acids long. The calculated 
M r of 55,841 agrees with the 55 kd gag precursor poly- 
peptide (Luc Montagnier, unpublished results). The N- 
terminal amino acid sequence of the major core protein 
p25, obtained by microsequencing (Genetic Systems, per- 
sonal communication), matches perfectly with the trans- 
lated nucleotide sequence starting from position 732 (see 
Figure 1). This formally makes the link between the cloned 
LAV genome and the immunologically characterized LAV 
p25 protein. The protein encoded 5' of the p25 coding se- 
quence is rather hydrophilic. Its calculated M r of 14,866 is 
consistent with that of the gag protein pia The 3' part of 
the gag region probably codes for the retroviral nucleic 
acid binding protein (NBP). Indeed, as in HTLV-I (Seiki et 
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Figure 3. Schematic Representation of the LAV Long Terminal Repeat 
(LTR) 

The LTR was reconstructed from the sequence of U19 by juxtaposing 
the sequences adjacent to the Hind til cloning sttes. Sequencing of 
olicp(cn>primed LAV ON A clone plAV7S (Alizon et al., 1984) rules out 
the possibility of clustered Hind III sites in the R region of LAV. LTR are 
limited by an inverted repeat sequence (IR). Both of the viral elements 
flanking the LTR have been represented as tRNA primer binding site 
(PBS) for 5' LTR and pofypurine track (PU) for 3' LTR. Also indicated 
are a putative TATA box, the cap site, poiyadenylation signal (AATAAA), 
and poiyadenylation site (CAA). The location of the open reading frame 
F (648 nucleotides) is shown above the LTR scheme. 

al. f 1983) and RSV (Schwartz et al., 1983). the motif Cys- 
X 2 -Cys-X M -Cys common to all NBP (Oroszlan et al.. 1984) 
is found duplicated (nucleotides 1509 and 1572 in LAV se- 
quence). Consistent with its function the putative NBP is 
extremely basic (17% Arg + Lys). 
pol 

The reverse transcriptase gene can encode a protein of up 
to 1003 amino acids (calculated M r - 113,629). Since the 
first methionine codon is 92 triplets from the origin of the 
open reading frame, it is possible that the protein is trans- 
lated from a spliced messenger RNA, giving a gag-pol 
poryprotein precursor. 

The pol coding region is the only one in which signifi- 
cant homology has been found with other retroviral protein 
sequences, three domains of homology being apparent. 
The first is a very short region of 17 amino acids (starting 
at 1856). Homologous regions are located within the pl5 
gagRSv protease (Dittmar and Moelling, 1978) and a poly- 
peptide encoded by an open reading frame located be- 
tween gag and pol of HTLV-I (Figure 5) (Schwartz et ai . 
1983; Seiki et al., 1983). This first domain could thus cor- 
respond to a conserved sequence in viral proteases, its 
different locations within the three genomes may not oe 
significant since retroviruses, by splicing or other mecha- 
nisms, express a gag-pol polyprotein precursor (Schwartz 
et al., 1983; Seiki et al., 1983). The second and most ex- 
tensive region of homology (starting at 2048) probably 
represents the core sequence of the reverse transcrip- 
tase. Over a region of 250 amino acids, with only minimal 
insertions or deletions, LAV shows 38% amino acid iden- 
tity with RSV, 25% with HTLV-I, and 21% with MoMuLV 
(Schinnick et al., 1981) while HTLV-I and RSV show 38% 
identity in the same region. A third homologous region is 
situated at the 3' end of the pol reading frame and corre- 
sponds to part of the pp32 peptide of RSV that has ex 
onuclease activity (Misra et ai., 1982). Once again, mere 
is greater homology with the corresponding RSV se- 
quence than with HTLV-I. 
env 

The env open reading frame has a possible r- v.or 
methionine codon very near the beginning (eighth . 
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Figure 4. Synthesis of RNA-Primed UW cDNA for R+U5 (Strong-Stop 
cONA) 

Lanes 1 and 2 show two different quantities of cDNA while tanes M and 
M' represent markers. The strong-stop cDNA is 181 bases long with a 
second, less intense band at 180. The error of estimation is ±1 bp. This 
maps the major cap site to (he second G residue of the sequence 
CTGGGTCT within the LTR, 24 nucleotides downstream of the TATA 
box. This guanosine residue is taken as the first base in the nucleotide 
sequence shown in Figure 1. 

If so, the molecular weight of the presumed env precursor 
protein (861 amino acids, M r calc = 97,376) is consistent 
with the known size of the LAV glycoprotein (110 kd and 
90 kd after giycosidase treatment; Luc Montagnier, unpub- 
lished). There are 32 potential N-glycosylation sites (Asn- 
X-Ser/Thr), which are overlined in Figure 1. An interesting 
feature of env is the very high number of Trp residues at 
both ends of the protein. There are three hydrophobic 
regions, characteristic of the retroviral envelope proteins 
(Seiki et al. ( 1983), corresponding to a signal peptide (en- 
coded by nucleotides 5815-5850 bp), a second region 
(7315-7350 bp), and a transmembrane segment (7831- 
7896 bp). The second hydrophobic region (7315-7350 bp) 
is preceded by a stretch rich in Arg + Lys. It is possible 
that this represents a site of proteolytic cleavage, which, 
by analogy with other retroviral proteins, would give an ex- 
ternal envelope polypeptide and a membrane-associated 
protein (Seiki et aL, 1983; Kiyokawa et ai., 1984). A striking 
feature of the LAV envelope protein sequence is that the 
region following the transmembrane segment is of un- 
usual length (150 residues). The env protein shows no 




Figure 5. Location of a Short Stretch of Homology in the gag-pol Re- 
gion of the LAV, HTLV-I (Seiki et ai., 1983) and RSV (Schwartz et a!., 

1983) Genomes 

Conserved amino acids are boxed. Homologous region is shown by 
the solid bar in the schema. Each virus is organized differently in this 
region but the sequence in the RSV genome maps to pi5 w , which 
has a protease- associated function. 

homology to any sequence in protein data banks. The 
small amino acid motif common to the transmembrane 
proteins of all leukemogenic retroviruses (Cianciolo et ai., 

1984) is not present in LAV env. 
O and F 

The location of orf Q is without precedent in the structure 
of retroviruses, Orf F is unique in that it is half-encoded 
by the U3 element of the LTR. Both orf have strong initiator 
codons (Kozak, 1984) near their 5' ends and can encode 
proteins of 192 amino acids (M r calc = 22,487) and 206 
amino acids (M r calc = 23,316), respectively. Both puta- 
tive proteins are hydrophilic (pQ 49% polar, 15.1% Arg + 
Lys; pF 46% polar, 11% Arg + Lys) and are therefore un- 
likely to be associated directly with membrane. The func- 
tion for the putative proteins pQ and pF cannot be 
predicted, as no homology was found by screening pro- 
tein sequence data banks. Between orf F and the pX pro- 
tein of HTLV-I there is no detectable homology. Further- 
more, their hydrophobicity/hydrophilicity profiles are 
completely different. It is known that retroviruses can 
transduce cellular genes— notably proto-oncogenes 
(Weinberg, 1982). We suggest that orfs Q and F represent 
exogenous genetic material and not some vestige of cellu- 
lar DNA because LAV DNA does not hybridize to the hu- 
man genome under stringent conditions (Alizon et aK, 
1984), and their codon usage is comparable to that of the 
gag, pol, and env genes (data not shown). 

Relationship to Other Retroviruses 

Although LAV is both morphologically and biochemically 
(Barre-Sinoussi et aL, 1983) distinct to HTLV-I and -II, it re- 
mained possible that its genome was organized in a simi- 
lar manner. The characteristic features of HTLV-I and -II 
genomes, which they share with the more distantly related 
bovine leukemia virus (BLV) (Rice et al. t 1984), are not 
observed in the case of LAV. These are: a region 3' of 
the envelope gene consisting of a noncoding stretch 
(600-900 bp), followed by a coding sequence of 307-357 
codons (X open reading frame), which may slightly over- 
lap the U3 region of the LTR (Seiki et al., 1983; Rice et al M 
1984; Sagata et al., 1984) and, second, the LTR being 
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Table 2. Comparison of the Size of the LAV LTR and LTR-Related 
Element to Those of Other Retroviruses 





LTR 


U3 


R 


US 


PU 


PBS 


IR 


LAV 


638 


456 


97 


85 


15 


LYS 


4 


HTLV-I 


759 


355 


228 


176 


1? 


PRO 


4' 


HTLV-ll 


763 


314 


248 


261 


12* 


PRO 


4' 


MMTV 


1.332 


1.197 


11 


124 


19 


LYS 


8 1 


MoMuLV 


594 


449 


68 


77 


13 


PRO 


13 


RSV 


335 


234 


21 


80 


11 


TRP 


15 


SNV 


601 


420 


97 


80 


13 


PRO 


9 



Adapted from Chen and Barker (1984). 
i - imperfect match or tract. 

SNV * spleen necrosis virus (Shimotohno and Temin, 1982). 



composed of unusually tong U5 and R elements and the 
polyadenylation signal being situated in U3 instead of R 
(Seiki et al., 1983; Sagata et al., 1984; Shimotohono et al., 
1984). We show here that, in contrast, the 3' end of the LAV 
envelope gene overlaps an open reading frame, termed F, 
that has the coding capacity for 206 amino acids and ex- 
tends within the LTR (1 10 amino acids are encoded by the U3 
region). The putatively encoded polypeptide (pF), the pri- 
mary structure of which can be deduced, does not show 
any homology with the theoretical X gene products of the 
KTLV/8LV family. Also, the U5 and R elements are shorter 
(Table 2) and the polyadenylation signal is located within R, 
as is the case for all retroviruses except the HTLV/BLV. Ad- 
ditionally, LAV uses tRNA^* as (-) strand primer, as op- 
posed to tRNAf* 0 employed by ail other mammalian retro- 
viruses except MMTV (Donehower et al. ( 1981). Those 
homologies detected between the polymerase and pro- 
tease domains of LAV and HTLV are also found in several 
retroviruses, RSV in particular. 

It has been reported that a cloned HTLV-III genome 
hybridizes (T m = 28°C) to sequences in the gag-pol and 
X regions of HTLV-I and -II; although restriction maps of 
cloned LAV and HTLV-III show almost perfect agreement 
(Hahn et at., 1984), we were unable to detect any such 
hybridization between LAV and HTLV-M (T m = 55°C) 
(Alizon et at., 1984). Indeed, there is a punctual region of 
homology between LAV and HTLV-I (23/27 nucleotides 
starting at position 1859 in the LAV sequence) but nothing 
significant between the two viruses in the X region of 
HTLV-I. One possible reason for this discrepancy is that 
HTLV-III is subtly different from LAV. However it was sub- 
sequently reported that there was very minimal, if any, ho- 
mology between orf X {of HTLV-I) and HTLV-III (Shaw et al., 
1984). 

Discussion 

Regulatory sequences carried by retroviral LTR are be- 
lieved to be involved in specific interactions between the 
viral genome and the host cell (Srinivasan et al., 1984). 
The LTR sequences of LAV are unique among retrovi- 
ruses. That could reflect an original mode ol gene ex- 
pression, possibly in relation to particular transcriptional 
factors present in the virus-harboring cell. This hypothesis 
can be tested by studying the regulatory activity of the LAV 



LTR sequences in transient or long-term experiments in- 
volving an indicator gene and different cellular contexts. 

The presence of the Q and P reading frames in addition 
to the conventional gag-pol-env set of genes is unex- 
pected. One should now address the question of their role 
in the viral cycle and pathogenicity by trying to character- 
ize their protein product(s). It is tempting to speculate on 
a role of such polypeptide(s) in T4 cells' mortality, a prob- 
lem that can be studied by designing synthetic peptides 
for antibody production or by using site-directed mutagen- 
esis of Q and F coding regions. 

The peculiar genetic structure of LAV poses the ques- 
tion ol its origin. The virus shares common tracts with other 
(apparently unrelated) retroviruses. For instance, the un- 
usually large size of the outer membrane glycoprotein 
(errv) and a comparably sized genome are also observed 
in the case of lenttviruses such as Visna (Harris et al., 
1981; Querat et al., 1984). The presence of a large part of 
the F open reading frame in the LTR, and the use of 
tRNAf as a primer for (-) strand synthesis, is reminis- 
cent of the mouse mammary tumor virus. On the other 
hand, homologies in the pol gene would suggest that the 
LAV is closer to RSV than to any other retroviruses. Obvi- 
ously, no clear picture can be drawn from the DNA se- 
quence analysis as far as phytogeny is concerned. Thus, 
it may well be that LAV defines a new group of retroviruses 
that have been independently evolving for a considerable 
period of time, and not simply a variant recently derived 
from a characterized viral family. Both epidemiology and 
pathogeny of AIDS should be reconsidered with this idea 
in mind, when trying to answer such questions as these: 
Are there other human or animal diseases that are as- 
sociated with similarly organized viruses? Is there a precur- 
sor to AIDS-associated viruses) normally present, in la- 
tent form, in human populations? What triggered in this 
case the recent spreading of pathogenic derivatives? 

Experimental Procedures 

M13 Cloning and Sequencing 

Total AJ19 DNA was sonicated, treated with the Klenow fragment of 
ONA polymerase plus deoxyribonucleotides (2 hr, 16°C), and fraction- 
ated by agarose gel electrophoresis. Fragments of 300-600 bp were 
excised, el eel roe luted, and purified by Elutip (Schleicher and Schuil) 
chromatography. DNA was ethanoi-precipitated using 10 ^g dextran 
T40 (Pharmacia) as carrier and ligated to dephosphorylated. Sma i- 
cleaved M13mp8 RF DNA using T4 DNA and RNA ligases (16 hr, 16°C) 
and transfected into E. colt strain TS-I. Recombinant clones were de- 
tected by plaque hybridization using the appropriate "P-labeled LAV 
restriction fragments as probes. Single-stranded templates were pre- 
pared from plaques exhibiting positive hybridization signals and were 
sequenced by the dideoxy chain termination procedure (Sanger et al.. 
1977) using a-"S-dATP (Amersham, 400 Ci/mmol) and buffer gradient 
gels (Biggen et ah, 1983). Sequences were compiled and analyzed 
using the programs of Staden adapted by B. Caudron for the insntut 
Pasteur Computer Center (Staden, 1982). 

Strong-Stop cDNA 

LAV virions from infected T lymphocyte (Barre-Sinoussi et al . 1983) 
culture supernatant were peileted through a 20% sucrose cushion and 
the cDNA (-) strand was synthesized as described previously (Ah zon 
et al.. 1984) except that no exogenous primer was used. After aikairne 
hydrorysts (03 M NaOH, 30 min. 65°C), neutralization, and phenol ex- 
traction, the cDNA was ethanoi-precipitated and loaded onto a 6 ; c 
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acrylamide/8 M urea sequencing gel with sequence ladders as size 
markers. 
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ABSTRACT 

To examine the mechanism of lymphocy to toxicity induced by human T-lymphotropic 
virus type III/lymphadenopathy associated virus (HTLV-III/LAV) , an in vitro 
model has been developed* Introduction of an HTLV-III/LAV pro viral clone, 
HXB2, into normal lymphocytes results in the production of virions and cell 
death* The complete nucleotide sequence of the proviral form of HXB2 has now 
been determined. Its structure is quite similar to that previously determined 
for HTLV-III/LAV clones whose biological capacities had not previously been 
demonstrated. The biological function of two additional clones of HTLV-III/ 
LAV. BH10 and HXB3, are reported* Clone BH10 which lacks the 5 'long terminal 
repeat sequences (LTR) and a portion of the 3'LTR is reconstituted by substi- 
tuting the corresponding sequences of HXB2 and is shown to be capable of gener- 
ating infectious cytopathic virions* Clone HXB3, which has been partially 
sequenced, is also found to be capable of producing lymphocytopathic virus* 
Clone HXB3 differs from HXB2 in its lack of a termination codon in 3*orf , dem- 
onstrating that 3'orf plays no major role in virus replication or cytopathic 
activity* These data provide the necessary background to allow the identifica- 
tion of viral determinants of replication, cytopathic activity, and antigenici- 
ty using these functional proviral clones. 



INTRODUCTION 

AIDS is a devastating illness occurring as an epidemic with more than 
20,000 cases identified thus far (1,2)* It represents the most severe clinical 
manifestation of infection by HTLV-III/LAV, also designated AIDS-related virus, 
ARV (3) or human immunodeficiency virus, HIV (4), which is present in 1.0-2*0 
million individuals in the United States alone (5,6)* There is overwhelming 
evidence that HTLV-III/LAV is the etiological agent in AIDS and AIDS-related 
syndromes (7-17)* One of the most convincing lines of evidence is the recapit- 
ulation of T4 cell depletion in vitro as a result of HTLV-III/LAV infection 
(18-20)* Thus, an understanding of the mechanism(s) involved in lymphocyto- 
pathic effects is paramount for understanding the pathogenesis of this disease* 
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A number of different recombinant DNA clones of HTLV-III/LAV have been 
obtained and analyzed by restriction enzyme digestion and/or nucleotide se- 
quencing (21-32). This work has demonstrated that the viral genome is 9182- 
9213 nucleotides in length, with LTRs of 636-637 nucleotides, and at least 
seven genes. Three replicative genes include gag » pol , and env which are simi- 
lar to those in other retroviruses, though env is longer than that of other 
retroviruses ( 28,33) • A fourth gene, designated tat , is structurally distinct 
from that of other retroviruses , and encodes a trans-acting factor capable of 
enhancing ^rus expression in a positive feedback manner (34-40). A fifth gene 
has recently been identified, and has been named art or anti-repjessor of 
transactivation (41) or trs or trans-repressor of splicing (42). Two addition- 
al genes, designated short open reading frame ( sor) and 3 'open reading frame 
( 3'orf ) are also unique to HTLV-III/LAV, but the functions of their gene prod- 
ucts are unknown (27-29,31,43-45). An additional open reading frame, designat- 
ed R» is also presumed to encode a protein product based on the finding of 
antibodies in infected individuals reactive to these sequences expressed in E. 
coli (our unpublished observations with J. Ghrayeb). 

To define the functions of viral proteins and locate the sequences encod- 
ing the cytopathic factor(s), an in vitro model has been established (46). 
Cloned viral DNA sequences are introduced into normal lymphocytes derived from 
umbilical cord blood using the protoplast fusion technique. Viral DNA, RNA, 
and proteins are readily detectable after 7-24 days, as well as virions morpho- 
logically identical to those arising from natural infection. Most notably, 
cell death occurs 18-30 days after transf ection. Thus, transfection of HTLV- 
III/LAV proviral DNA into normal lymphocytes results in the production of lym- 
phocytopathic virus, reproducing the major manifestations of Infection observed 
both in the laboratory and in humans (18-20,46). 

This experimental system allows analysis of viral sequences required for 
the cytopathic activity by in vitro mutagenesis prior to transfection. To 
provide the necessary background for the construction of these mutants, and to 
gain further insight into the structure of active viral proteins, we have de- 
termined the complete nucleotide sequence of the functional HTLV-III/LAV clone 
HXB2. In addition, we have ligated a previously sequenced HTLV-III/LAV clone, 
BH10 (28), to LTRs of HXB2 and have demonstrated that this clone also gives 
rise to cytopathic virus. 
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MATERIALS AND METHODS 



Recombinant DNA Clones 



A single T4-positive cell line, H9, was inoculated with pooled blood sam- 
ples of different patients with AIDS or related symptoms (20). Recombinant 
phage clones XHXB2 and XHXB3 were derived from this infected cell line by clon- 
ing integrated proviral copies with flanking cellular sequences in the Xba I 
site of phage Jl (47)* A 12.5 kilobase (kb) Hpa I - Xba I fragment of XHXB2 
was blunt-ended with Klenow fragment of DNA polymerase I and cloned into the 
similarly blunt-ended Bam HI to Eco RI sites of vector SP65gpt. The resultant 
clone HXB2gpt2 has the HTLV-III and xanthine guanine phosphoribosyltransf erase 
(gpt) sequences in the same transcriptional orientation. SP65gpt was con- 
structed by ligating the Baa HI - Pvu II fragment of pSV2gpt (48) into the Bam 
HI - Pvu II sites of SP65 (Promega Scientific). Other subclones of XHXB2 were 
made in SP65 and SP62 (New England Biolabs). 

XBH1Q was derived from the same infected cell line by cloning an uninte- 
grated viral copy in the Sst I site of Xgtwes-Xb (25). The 8933 nucleotide 
insert of XBH10 (nucleotides 222-9154, based on the numbering scheme in ref. 
28) was subcloned into the Sst I site of SP64 (Promega Scientific). 

Plasmid HXB3 was constructed by subcloning the 13.0 kb Xba I insert of 
XHXB3 in the Xba I site of SP62. Clone HXB2/3gpt was made by replacing the 2.3 
kb Xho I - Xba I fragment of HXB2gpt with the 1.2 kb Xho I - Xba I insert of 
HXB3. 
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DNA Sequencing 

The entire HXB2 proviral sequence and about 300 nucleotides of 5' and 3* 
flanking cellular sequences were determined by the partial chemical cleavage 
method (49) except nucleotides 3306-3739, which were determined by the dideoxy 
chain termination method (50). 



Assays for Cytopathic Activity 

Cytopathic activity of viruses derived from proviral DNA clones was tested 
in normal umbilical cord blood mononuclear cells or a human T-lymphoctropic 
virus type I immortalized nonproducer cell line, ATH8, as previously described 
(46,51,52). 10 10 protoplasts carrying a plasmid DNA sequence were fused using 
polyethylene glycol with 2 x 10^ phytohemagglut inin (PHA) stimulated cord blood 
mononuclear cells. Cultures were grown in RPMI-1640 medium (Gibco) supple- 
mented with 10% (v/v) fetal calf serum (FCS, Gibco), 10% (v/v) interleukin 2 
(IL2, lec tin-depleted; Cellular Products), 50 u/ml penicillin, and 50 yg/ml 
streptomycin (Gibco). Viable cell counts were determined using the trypan blue 
dye exclusion method (53) or by examination using phase contrast microscopy, at 
3-5 day intervals. 

Plasmid s with the HTLV-III/LAV proviral sequences were also transfected 
into H9 cells by the protoplast fusion method. Virus preparations were made 
from 2-4 liters of cell-free supernatants of these cultures by centr ifugation 
at 100,000 x g for 1 hour at 4°C. The pellet was re suspended in RPMI-1640 
medium and diluted to a concentration of 10* * particles/ml as determined by 
electron microscopy. Concentrated virus was then added in 0*2 ml of RPMI to 2 
x 10^ - 2 x 10 5 ATH8 cells pre treated for 30 minutes at 37°C with 2 Mg/ml poly- 
brene. The virus was allowed to adsorb for 45 minutes at 37 °C, -and then the 
cultures were diluted to 2 ml with RPMI-1640 medium supplemented with 15% (v/v) 
FCS, 5% (v/v) IL2, 50 u/ml of penicillin, 50 Mg/ml streptomycin, 4 mM L-gluta- 
mine, and 50 uM be ta-mercaptoe thanol , Cells were grown in Falcon 3033 tubes. 
At various time points, viable cell counts were determined by trypan blue dye 
exclusion. 



RESULTS 

The complete nucleotide sequence of clone HXB2 is shown in Figure 1. 
Seventy-nine nucleotide substitutions are noted compared to a previously se- 
quenced proviral clone, BH10 (28). Few if any of these sequence differences 
are likely to represent cloning artifacts or sequence errors since 1) these 
alterations were confirmed by DNA sequences from both strands of both clones, 
and 2) 82% of these changes are present in other previously sequenced HTLV- 
III/LAV clones (23,27-31). Of note is the lack of termination codons or frame- 
shifts within any of the previously described open reading frames. In addi- 
tion, two insertions in HXB2 relative to BH10 are in noncoding regions, and a 
single in- frame 36 bp deletion is present in the region of overlap between the 
gag and pol genes. The latter alteration represents a deletion of one of two 
copies of a perfect tandemly repeated sequence present in BfllO. 

The functional capabilities of clone BH10 were also examined. Since this 
clone lacked the complete viral sequence, the missing portions were comple- 
mented by those obtained from clone HXB2 (Fig. 2a) ♦ The Cla I - Xho I insert 
of clone BH10 (nucleotides 374-8474, corresponding to amino acid residue 14 of 
gag to 44 of 3'orf) was inserted in place of that at the same positions of 
clone HXB2gpt2. The resultant clone HXlOgpt could be distinguished from either 
BH10 or HXB2gpt2 by the presence of a Bgl II site at nucleotide position 20, a 
Sst 1 site at position 34, and Hind III sites at positions 78 and 9194 in 
HXlOgpt and HXB2gpt2 but absent from BH10, and the presence of a Sst I site at 
position 5580 and a Hind III site at position 5607 in HXB2gpt2 but absent from 
BH10 and HXlOgpt. Transfection of HXlOgpt into umbilical cord blood lympho- 
cytes resulted in virions with morphology similar to those arising with infec- 
tion (Fig. 2b and 2c). 
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Fig, 2 - Functional capabilities of clone BHIO. a) To test the 
activity of clone BHIO a recombinant was constructed with 
HXB2gpt2. The Cla I - Xho I insert of BHIO was ligated 
into the corresponding positions of HXB2gpt2 to generate 
clone HXlOgpt. Bgl II, Sst I, and Hind III sites found in 
HX10 and/or HXB2 but not BHIO are indicated. The relative 
positions of these sites in the HTLV-III/LAV genome are 
shown by the schematic above the restriction enzyme maps. 
Nucleotide positions are indicated at the top. Electron 
micrographs of b) immature and c) mature viral particles 
identified 14 and 56 days, respectively, after transfec- 
tion of umbilical cord mononuclear cells by protoplast 
fusion (46) are shown (90,000x magnification)* 



Clone HXB3 was also tested for its biological activity (Fig. 3). Se- 
quences for the 3' portion of HXB3 have been determined; they differ from those 
of HXB2 between nucleotides 5323 and 9213 at 63 positions (23,54). It is nota- 
ble that HXB3 lacks a termination codon at amino acid position 124 of 3'orf 
which is found in HXB2. A hybrid clone HXB2/3gpt was also constructed (Fig. 3 
and Materials and Methods section), replacing the last 163 codons of 3'orf and 
the 3'LTR with the corresponding sequences of HXB3. Transfection of HXB3 and 
HXB2/3gpt into cord blood mononuclear cells or H9 cells also gave rise to In- 
fectious virions with characteristic HTLV-III/LAV morphology. 
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Fig. 3 - Functional HTLV-III/LAV DNA clones with or without a 
termination codon in 3'orf . The structure ot clone HXB2 
which^includes a termination codon in 3 f orf and clones 
HXB3 and HXB2/3 which lack this termination codon are 
shown* The positions of Hind III sites which distinguish 
the clones from one another are shown. In addition, the 
positions of Xho I and Xba I sites used for constructing 
the hybrid clone HXB2/3 are shown. Plasmids constructions 
are described in the Materials and Methods section. 
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Assays of the cytopathic abilities of virus produced by these DNA clones 
were then carried out in umbilical cord blood mononuclear cell cultures and the 
ATH8 cell line. Transfection into umbilical cord blood mononuclear cells of 
SP65gpt, which lacks HTLV-HI/LAV sequences, resulted in growth of the cell 
culture at a rate similar to that of an untransfected culture (Fig. 4 and data 
not shown, ref. 46). Transfection of HXB2gpt2 resulted in the production of 
viral DNA, gag proteins, and viral particles with a morphology typical for that 
of HTLV-III/LAV, and an accelerated rate of cell death (Fig. 4). Transfection 
of HXB3 produced a similar rate of cell killing as did HXB2gpt2. Introduction 
of HXlOgpt into cord blood mononuclear cells showed an attenuated rate of cell 
killing compared to HXB2gpt2 and HXB3, though it was reproducibly greater than 
that of SP65gpt. The diminished rate of cell death of cultures transfected 
with HXlOgpt correlated with the delay in the appearance of viral compared to 
cultures transfected with HXB2gpt (Fig. 2, ref. 46, and data not shown). 

Virus was prepared from the cell-free supernatant fluid of U9 cells either 
infected with an HTLV-III/LAV preparation (HTLV-IIIB) or transfected with plas- 
mids HXB2gpt2, HXlOgpt, or HXB3. These samples were then diluted by particle 
counts to give multiplicities of infection (moi) of 50-3000 virions per cell 
(Fig. 5). In each of three separate experiments, a sample with no yirus was 
tested as well as the same reference virus preparation, HTLV-IIIB, at varying 
moi* These data reveal that viruses derived from DNA clones HXB2gpt2, HXlOgpt, 
and HXB3 all produce cytopathic effects on human T cells. 

RESULTS 

In order to clarify relationships between structure and function of viral 
proteins encoded by the HTLV-III/LAV genome, the complete nucleotide sequence 
of the functional clone HXB2 has been determined. In addition, the previously 
sequenced clone BH10 (28), when llgated to LTRs of HXB2, is also shown to be 



63 



functional. The sequence of HXB2 differs from that of BH10 in only 79 nucleo- 
tides. Furthermore, insertions of 2 and 3 nucleotides, respectively, are found 
in noncoding regions. 




Time after Protoplast Fusion 
Cdays) 

4 -.Cytopathic effects of functional HTLV-III/LAV DNA clones 
transfected into umbilical cord blood mononuclear cells* 
Plasmid clones SF65gpt , HXlOgpt, HXB2gpt, and HXB3 were 
transfected into cord blood mononuclear cells by proto- 
plast fusion as described in the Materials and Methods 
section. The number of viable cells in the cultures were 
determined over a period of 28 days following transfec- 
tion. 
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Exposure to Virus 

Cytopathic effects of virus derived from HTLV-III/LAV DNA 
clones towards ATH8 cells. Samples of virus were prepared 
by tranfection of H9 cells with DNA clones HXB2gpt, 
HXlOgpt, or HXB3 as described in the Materials and Methods 
section. Different amounts of virus were added to ATH8 
cells, and the number of viable cells determined over a 
period of 10 days after infection. Panels a), b), and c) 
each represent separate experiments. The same stock virus 
preparation, HTLV-IIIB (20), was used as a positive con- 
trol in each case. 
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One notable feature of clone HXB2 is the loss of one copy of a tandemly 
repeated 36 bp sequence in the overlap between the gag and the pol genes, 
whereas most other sequenced clones, including BH10, have two copies of this 
sequence. This pattern of insertion or deletion of perfect or imperfect: re- 
peated DNA sequences is a common mode of variation of HTLV-III/LAV sequences, 
most likely occurring as a result of jumps by the reverse transcriptase during 
synthesis of DNA intermediates. Furthermore, the presence of one or two copies 
of this 36 bp sequence does not have a major influence over the rate of virus 
replication, and, thus, it does not perturb f rameshif ting which most likely 
occurs near this region in the synthesis of the gag-pol precursor protein. 

The similarity of the structure of the functional clone HXB2 to that of 
BH10, and the demonstration that clone BH10 is functional, reaffirms interpre- 
tations based on the BHiO sequence data (28) and subsequent analysis of cDNA 
clones (27,35). The multiple open reading frame identified in the viral genome 
of previously sequenced clones are also found in the functional clone HXB2 are 
similar in size and position. Furthermore, no additional open" reading frames 
are found in HXB2 which are absent from the other DNA clones. The recent re- 
port that the previously sequenced clone ARV-2 is also functional provides 
further support for these conclusions (55). 

The identification of additional functional clones of HTLV-III/LAV provide 
clues for unravelling the biochemical basis of virus explication and cell kill- 
ing. Clone HXlOgpt revealed attenuated cytopathic effects after transfection 
into cord blood mononuclear cells, which correlated with a delay in the appear- 
ance of viral particles. However, infection with virus derived from this plas- 
mid clone revealed substantial cytopathic effects. Thus, the results in cord 
blood mononuclear cells are most likely due to either a lower transfection 
efficiency achievable with this clone, or mildly reduced infectivity of virus 
obtained from this clone in cord blood mononuclear cells, rather than a true 
reduction in the cytopathic potential of this genome per se. 

Clone HXB3 also gives rise to cytopathic virus after transfection into 
cord blood mononuclear cells. The kinetics of virus production and cell kill- 
ing are comparable to those of cultures transfected with HXB2gpt2. Virus de- 
rived from HXB3 also showed substantial cytopathic effects on ATH8 cells. 
Though only a portion of the sequence of HXB3 has been determined, it appears 
to be very similar to that of HXB2 (23,54). However, clone HXB2 has a termina- 
tion codon at amino acid codon 124 in the 206 codon 3'orf gene, whereas clone 
HXB3 has a tryptophan codon at this position (54). The normal 3*orf product 
has been shown to be 27 kilodaltons (kd), whereas that generated from HXB2 is 
truncated and is 13 kd (42). This suggests that functions encoded by the sec- 
ond half of the 3 1 orf gene are nonessential. This is confirmed by the demon- 
stration of similar functional activity of clone HXB2/3gpt in which only amino 
acids 43-206 of HXB2 3 'orf are replaced by those derived from HXB3. Recent 
data have also demonstrated that deletions and frame shifts between amino acid 
codons 22 and 58 of 3'orf also do not affect the functional capabilities of the 
DNA clone (56). Thus, 3'orf is not required for in vitro replication or cyto- 
pathic activity. 

The establishment of an in vitro system for AIDS, the identification of 
functional clones of HTLV-III/LAV, and the determination of the complete nucle- 
otide sequence of functional clones provide the necessary tools and information 
for dissection of the viral genome to uncover the determinants of cytopathic 
activity and virus replication. By manipulating the genome of HXB2 to produce 
mutations in the virus, and analyzing the effects of such alterations in our in 
vitro system, we have recently mapped a major determinant of the cytopathic 
activity of HTLV-III/LAV to the 3* region of the virus (56). The use of molecu- 
lar clones AIDS virus as shown here will also provide* homogeneous stocks of 
virus useful for analysis of viral targets of cellular and humoral immunity. 
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